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Preface 



DISC, the International Symposium on Distributed Computing, is an annual 
forum for research presentations on all facets of distributed computing. This 
volume includes 23 contributed papers and an invited lecture, all presented at 
DISC ’99, held on September 27-29, 1999 in Bratislava, Slovak Republic. 

In addition to regular submissions, the call for papers for DISC ’99 also so- 
licited Brief Announcements (BAs). We received 60 regular submissions and 15 
brief announcement submissions. These were read and evaluated by the pro- 
gram committee, with the additional help of external reviewers when needed. At 
the program committee meeting on June 10-11 at Dartmouth College, Hanover, 
USA, 23 regular submissions and 4 BAs were selected for presentation at 
DISC ’99. The extended abstracts of these 23 regular papers appear in this 
volume, while the four BAs appear as a special publication of Comenius Univer- 
sity, Bratislava - the host of DISC ’99. It is expected that the regular papers will 
be submitted later, in more polished form, to fully refereed scientific journals. 

Of the 23 regular papers selected for the conference, 12 qualified for the 
Best Student Paper award. The program committee awarded this honor to the 
paper entitled “Revisiting the Weakest Failure Detector for Uniform Reliable 
Broadcast” by Marcos Aguilera, Sam Toueg, and Borislav Deianov. Marcos and 
Borislav, who are both students, share this award. 

DISC ’99 hosted three invited lectures, by Tushar Chandra (IBM), Faith Fich 
(University of Toronto), and Michael Fischer (Yale University). The extended 
abstract of Tushar Chandra’s lecture appears in this volume. The extended ab- 
stracts of the other two invited lectures are expected to appear in next year’s 
DISC proceedings. 

The program committee for DISC ’99 consisted of: 



Angel Alvarez (Tech. U. of Madrid) 
Anindya Basu (Bell Labs) 

Shlomi Dolev (Ben-Gurion) 

Cynthia Dwork (IBM) 

Rachid Guerraoui (E. Polytechnique) 
Vassos Hadzilacos (U. Toronto) 
Maurice Herlihy (Brown) 

Prasad Jayanti, Chair (Dartmouth) 



Srinivasan Keshav (Cornell) 

Marios Mavronicolas (U. Connecticut) 
Yoram Moses (Technion) 

Alessandro Panconesi (U. Bologna) 
Mike Reiter (Bell Labs) 

Sam Toueg (Cornell) 

Moti Yung (CertCo) 



Peter Ruzicka was the Local Arrangements Chair for the conference. Miroslav 
Chladny and Rastislav Kralovic took charge of advertising and maintaining the 
web site for DISC ’99. We thank all of them for a superb job. 

We thank all the authors who submitted to DISC ’99 and the keynote 
speakers for accepting our invitation. We thank ACM SIGACT for letting us 
use its automated submission server and the electronic program committee 
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(EPC) server. We are especially grateful to Steve Tate and Peter Shor for their 
help with the submission server and the EPC server, respectively. We thank 
Springer- Verlag for continuing the tradition and publishing these proceedings. 

We gratefully acknowledge the generous support of the Slovak Society for 
Computer Science, Telenor, Comenius University, and Dartmouth College. The 
scope and the direction of DISC are supervised by the DISC steering commit- 
tee which, for 1999, consisted of Bernadette Charron-Bost (Ecole Polytechnique), 
Vassos Hadzilacos (University of Toronto), Prasad Jayanti (Dartmouth College), 
Shay Kutten (Technion and IBM), Marios Mavronicolas (University of Connecti- 
cut), Andre Schiper - Vice Chair (Ecole Poly technique, Lausaunne), and Shmuel 
Zaks - Chair (Technion). 



September 1999 



Prasad Jayanti 




External Reviewers 



The Program Committee thanks the following persons who acted as referees for 
DISC ’99; 
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A Case for Message Oriented Middleware 



Guruduth Banavar, Tushar Chandra, Robert Strom, and Daniel Sturman 

IBM T. J. Watson Research Center, Hawthorne, New York 
{banavar , tushar , strom , sturman}@watson . ibm . com 



Abstract. With the emergence of the internet, independent applica- 
tions are starting to be integrated with each other. This has created a 
need for technology for glueing together applications both within and 
across organizations, without having to re-engineer individual compo- 
nents. We propose an approach for developing this glue technology based 
on message flows and discuss the open research problems in realizing this 
approach. 



1 Introduction 

Recent advances in networking and the pervasive deployment of the internet have 
created new opportunities for computing research. Many applications are evolv- 
ing from monolithic to distributed systems. Business processes are increasingly 
being automated and interconnected in spontaneous ways. Companies increas- 
ingly require the integration of once independent applications, either because 
they are vertically integrating components of the business, or because of merg- 
ers or outsourcing of function to separate organizations. In summary, there is 
a convergence to loosely integrated distributed systems, where each component 
can evolve independently. 

In this paper, we make the case for research in “glue technology” for loosely 
integrating distributed systems. The basic observation underlying this technol- 
ogy is that widely disseminated, often real-time “events” (or messages) — e.g. 
stock quotations, advertisements, offers to buy and sell, weather reports, traf- 
fic conditions, etc. — are becoming ubiquitously available through the Internet. 
These public events, as well as internal events, such as orders, shipments, de- 
liveries, manufacturing line changes, can form the “glue” to link applications 
within and across organizations. Since this technology is based on messages, we 
use the term Message Oriented Middleware, or MOM for short, to refer to it. 

Consider the opportunities that arise within a single company. Today, most 
large companies consist of several smaller organizations that are supported by 
applications that were developed and have evolved largely independently of each 
other. In most cases, people provide the linkage between these smaller organiza- 
tions. For example, when a sale occurs, the sales representative sends a request 
to manufacturing to produce the product. In such cases, what is needed is a tech- 
nology that allows companies to automate such processes by tying together its 
organizations’ independent applications. Once tied together, the integrated com- 
puting system is potentially cheaper to operate, faster and less prone to errors 
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than the system in which humans glued together independent organizations. 
Further, this kind of integration opens up new opportunities for macroscopic 
analysis such as forecasting, and requirements analysis across the company that 
can further streamline operations. 

The value of tying together applications can easily be extended to partner- 
ships between companies. One obvious scenario is when two companies merge. 
Usually each company has its own set of independent applications — the chal- 
lenge is to provide “computing glue” to tie these applications together. Another 
obvious scenario is when two companies do business with each other. In today’s 
model, people are heavily involved in transacting business between companies. 
Business integration offers the opportunity to automate business interactions 
between companies by tying together their applications. 

As more consumers shop, invest, pay bills and taxes, read, and perform co- 
operative work and play online, the same glue technology may be extended to 
support business-to-consumer interactions. Better integration between business 
software and end-user software such as web browsers offers the promise of more 
power and greater ease of use to the end-user. 

We propose an approach for developing the glue technology required for mes- 
sage oriented middleware (MOM). Our proposed approach is derived from the 
publish and subscribe model [OPSS93] and event delivery systems [WIS]. In 
this model, the basic unit of data is a message, which corresponds to what are 
called “events” in publish/subscribe or event delivery systems. Clients register 
as publishers or subscribers of messages. Messages are sent to and delivered from 
information spaees, each of which has a predefined message sehema [BKS+99]. 
Information spaces may be tied together via message flow graphs that spec- 
ify how messages are propagated and transformed between producers and con- 
sumers. A message flow graph may route a filtered subset of messages from one 
information space to another, merge messages from multiple sources, or deliver 
a transformed version of a message from one information space to another. Some 
information spaces contain states summarizing the message history of other in- 
formation spaces. This state can be re-mapped back into a message sequence, 
often in more than one way. Systems can exploit this non-determinism by re- 
laxing the ordering of messages, by dropping obsolete messages, by compressing 
the past history being sent to a newly connecting subscriber or to a subscriber 
who has reconnected after being off-line. 

In the near future, we envision a pervasive MOM environment that glues to- 
gether a large number of stand-alone applications. Each application may evolve 
independently from the others in this environment. The MOM environment will 
support such evolution without requiring changes in other applications, and in 
fact, without requiring the other applications to be aware of the addition and 
removal of applications and clients. (The only exceptions would be those appli- 
cations dealing with access control and security). The MOM environment will 
allow new applications to “tap into” information generated by existing applica- 
tions without disturbing them. This will allow users to add higher order features 
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such as auditing, monitoring, and data mining on top of existing information 
flows, after the fact, and without disrupting the underlying applications. 

Several crucial research problems remain unsolved in the MOM approach, and 
even those that are solved have not been completely implemented yet. A complete 
and precise model for this approach has not yet emerged. Several key distributed 
computing problems remain open such as scalability, and how to provide end-to- 
end guarantees on message delivery. Some of the known algorithms tackle subsets 
of the overall problem. Questions related to fault-tolerance, security, message 
ordering, and topology changes that have been well studied in the context of 
other types of messaging systems are open areas for further research in the 
context of MOM. 

There are several efforts from various other communities to provide glue tech- 
nology to tie together applications, and it is not clear at this stage whether a 
single glue technology is best suited for all environments. The database commu- 
nity has extended the classical ideas underlying databases to distributed environ- 
ments via distributed transactions [TGGL82] and federated databases [Hsi92]. 
The languages community has extended the concept of objects to a distributed 
environment via remote method invocation (GORBA [Gro98], RMI [BN83], etc.). 
Group communication systems such as Isis have also been used to glue together 
applications in a distributed environment [BGJ+90,Bir93,Pow96]. Finally, there 
exist higher-level approaches such as Workflow that are targeted towards specific 
subsets of the overall problem [WfM,PPHG98]. We will compare our proposed 
MOM approach with these other approaches. 

The rest of this paper is organized as follows. In section 2 we examine sev- 
eral examples that motivate the need for message oriented middleware. In sec- 
tion 3 we elaborate on the message oriented middleware model. In section 4 we 
discuss open areas for further research in this area. In Section 5 we examine 
alternative approaches. Section 5 concludes with work related to content-based 
publish/subscribe systems. 

2 Examples 

In this section, we provide two examples that motivate the need for Message 
Oriented Middleware and illustrate the various requirements imposed by appli- 
cations: 

1 . Stock Trading. This example illustrates the need for anonymity between mes- 
sage publishers and message subscribers. In addition it demonstrates the 
need for MOM-based systems to be highly scalable. It also demonstrates 
the need for MOM-based deployments that cross organizational boundaries. 
Finally, this example motivates the need for on-the-fly transformation of 
messages into formats suitable for different clients. 

2. Home Shopping. In addition to anonymity between publishers and sub- 
scribers, this example illustrates the need for interpreting message streams 
to capture application-specific meaning. This example also shows the need 
for near real-time response from the MOM environment. 
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2.1 Stock Trading 

To illustrate how MOM can support application integration, consider a publish- 
subscribe based stock trading application written for a particular stock exchange, 
say the New York Stock Exchange. In such an application, stock trades, bids, 
and sales are published as messages. Brokers affiliated with the NYSE access this 
information by subscribing to events of particular interest to them. Stock trade 
events are published in a format beginning with the following four attributes: 
(1) NYSE ticker symbol, (2) share price, (3) share volume, (4) broker id. A 
similar application running at another stock exchange, say the NASDAQ, may 
also publish events corresponding to trades, but with a different format for the 
first four attributes: (1) NASDAQ ticker symbol, (2) share price, (3) capital 
in this trade, (4) broker code. Eor both markets, message rates are very high, 
and there are large numbers of publishers and subscribers. Thus, an important 
requirement for MOM is performance and scalability. 



Fig. 1. Applications using both NASDAQ and NYSE to exchange data. 



Now let us suppose that brokers and analysts previously dealt separately 
with both the NYSE and NASDAQ exchanges, and that in future they wish 
to run the same analysis programs for trades on both exchanges. In order to 
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run their internal analyses on information from both sources, they may wish 
to convert the data into a common internal format, and merge the two infor- 
mation streams into a single stream. For instance. Figure 1 shows both stock 
trade formats converted to a unified format consisting of the following first four 
attributes: (1) unique company name, (2) share price, (3) share volume, and 
(4) unique broker name. The MOM system should thus support transforming 
those messages it is delivering to clients requesting the common format while 
preserving the original format for legacy applications. Also, legacy NYSE client 
applications may wish to access NASDAQ trade events into the NYSE format 
and vice versa. That is, MOM-based transformation should enable clients ac- 
cessing both stock exchanges to access this integrated data without disrupting 
the operation of legacy applications. 

To extend the system to new applications, such as direct customer trading, it 
is simply necessary to “tap in” to the message streams. Provided that the infras- 
tructure can scale from hundreds of brokers to hundreds of thousands of on-line 
investors, each investor can specify an appropriate selection of interest — such as 
the issues in his portfolio. Of course the applications used by customers and by 
professional stockbrokers will be very different, but the message stream “glue” 
will remain the same. This example illustrates both the importance of anonymity 
between producers and consumers, and the need to cross organizational barriers. 
Applications posting events need not be aware of their destinations; applications 
subscribing to events need not be aware of their sources. Extending the system 
to bring stock trade events directly to home computers may change the system 
load, but not the fundamental architecture. 



2.2 Home Shopping 

As a second example of the use of MOM, consider a home shopping application 
where consumers may request up-to-the-minute information and pricing on re- 
tail items from “virtual markets” for products such as automobiles, computers, 
or camera equipment. Each message represents a seller’s ad: the seller’s identity 
and location, the type of article, and attribute-value pairs representing the at- 
tributes of the article being sold. Eor example, automobile advertisements would 
include the make, model, year, mileage, and options. The price might either be 
fixed, or left open to competitive bidding in a real-time auction. Additional mes- 
sages represent bids by buyers, and the open and closing of auctions. Consumers 
subscribe through a number of tests on these attributes. As in the previous ex- 
ample, messages must be routed to some subset of all subscribers based on their 
information content. 

An important function that MOM must support for this scenario is the com- 
munication of dynamic changes in the availability and the price of items. The 
seller may lower the price or withdraw his ad in response to lack of interest by 
customers. Or buyers may raise the price as a result of competitive bidding. 
Typically a buyer would subscribe not just to a stream of events, but to a state^ 
determined by the message history, such as the current price of items matching 
his criterion. As a result, new subscribers would not receive all messages from 
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Fig. 2. Suppressing and delivering advertisements in a home shopping scenario. 



the beginning of time but instead only those messages representing the current 
valid prices for these selected items. Messages which have become superseded by 
updates or which correspond to items no longer available for sale need not be 
delivered. This example illustrates the importance of interpreting event histories 
as a state — specifying such a summarization requires the system to understand 
that a new price for the same item supersedes the previous price, and that the 
termination of an auction or the withdrawal of an ad cancels all previous prices 
for that item. 

The ability to subscribe to a state summarizing a message history has addi- 
tional effects on message delivery if the buyer wants to subscribe not merely to 
the current offer for each selected item, but instead wants to track for instance, 
the lowest-priced ad matching his criteria, plus any items for which the buyer 
is still the high bidder. In this case, messages which do not impact the lowest 
price because they are ads for articles with higher prices are not delivered. Note 
that if the low-priced ad is modified or withdrawn, or if its price is bid up in 
an auction so that it is no longer the low-priced ad, then it may be necessary 
to either deliver the ad for the second-low-priced item (if it had previously been 
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suppressed) or to redeliver it. For example, consider the auto advertisements in 
Figure 2. The $20,000 price is the low price, so the $30,000 price is not delivered. 
However, once the $20,000 price is withdrawn or raised above $30,000, the ad 
for the $30,000 automobile is delivered to replace it. 

Notice that, in this example, there is no real organizational boundary. Anyone 
might register as a publisher (potential seller) or as a subscriber (potential buyer) 
of an information space for a particular product. 

3 A Flow-Graph Model for Message Oriented Middleware 

For supporting the examples in the previous section, we desire a model which: 
(1) facilitates expression of clients’ requirements; (2) is easy to reason about, 
both for validating specifications and implementations; (3) is rigorous, and (4) 
permits the widest possible latitude for the implementor. 

Good models are founded upon a small number of basic concepts. For exam- 
ple, the key concepts underlying transactional database models are atomicity and 
serializability. The data, whether distributed and/or replicated, whether concur- 
rently accessed or not, behave as if located at a single site and accessed one trans- 
action at a time. Similarly, the key concepts underlying the Isis model [BCJ+90] 
are group membership [Bir93,Pow96] and virtual synchrony [BT87]. Once a 
model is chosen, we have a rigorous requirement for developing clever imple- 
mentations for preserving the appearances of the model under a wide variety of 
design points and physical environments. 

The Gryphon project at IBM Research is exploring a particular model for 
MOM based upon the concepts of information spaces and message flows 
(http://www.research.ibm.com/gryphon). A system is modeled as a message 
flow graph: an abstract representation of the propagation of the messages in a 
system, divorced from any realization on an actual network topology [BKS+99]. 
A message flow graph is a directed acyclic graph whose nodes are information 
spaces and whose arcs are message flows. Information spaces are either totally 
ordered message histories or states derived from message histories. Each infor- 
mation space has a schema that specifies the type of the messages or of the 
state. Publishers and subscribers are source and sink message histories respec- 
tively. Message flows specify the propagation of messages or the updating of 
state, in order to preserve specific relationships between information spaces as 
new messages are added to the system. These relationships include: 

— selection: the destination message history receives a copy of the subset of the 
source message history that satisfies some boolean predicate P. For example, 
an analyst may request all trades of more than 1000 shares of XYZ company 
having a price greater than $40 per share. 

— transform: the destination message history receives a copy of each message 
of the source message history after applying a transform T. For example, the 
conversion “Volume = Gapital/Price” might be used to convert all messages 
from the NASDAQ format to the NYSE format. 
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— merge: when multiple message histories have arcs to a single destination, 
the destination receives a merge (in a non-deterministic order) of the all 
the messages of the sources. This operation is involved in any application 
involving multiple publishers. 

— eollapse: the destination state is computed by applying some summarization 
function S over all the messages in the source message history. In the home 
shopping example, a client may be interested in the lowest priced ad for Saab 
cars with less than 80,000 miles and a price under $6000. 

— expand: expand is the inverse of eollapse. The destination message history 
is (non-deterministically) computed to be any message history which sum- 
marizes to the source state under the summarization function S. All such 
message histories are said to be equivalent — the system is free to choose 
which one of the many message histories to deliver to a destination. 

Message flow graphs can evolve dynamically, and in fact the changes to these 
graphs and requests to change the graph are themselves meta-events that can 
be subscribed to. As in similar systems, access control policies limit who may 
add and delete nodes and arcs, and where they may be added. 

Notice that this model has some characteristics of database systems, some 
of group communication systems, and some unique characteristics of its own. 
Information spaces in MOM are analogous to tables in relational databases. 
Just as database tables have data schemas, information spaces have message 
schemas. The selection relation between information spaces is similar to the 
select operator between relational tables. Just as a relational database allows 
linkages between tables and views, MOM allows linkages between information 
spaces via message flow graphs. 

Like group communication systems, messages flow from producers to con- 
sumers without explicit requests from consumers — i.e., both systems are push- 
based. In a push-based environment, it is natural that operations on the message 
flow, such as selections, transforms, and summarizations, are performed incre- 
mentally. 

Defining and implementing the flow graph model gives rise to a number of 
open research issues. These issues are discussed in detail in the next section. 

4 Research Issues 

4.1 Model 

The message flow graph is a useful abstraction for specifying many problems 
such as the ones discussed in the previous section. It is easy to explain to users 
familiar with dataflow graphs. 

There are, however, a number of open issues. One is the type system for 
defining schemas for information spaces. Whereas it makes sense to organize 
relational databases as tables in normal form with each row consisting of a tuple 
of scalars, a similar “normal form” is probably not feasible for message histories. 
One reason is that while relational tables containing rows of different formats 
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can be factored into separate tables, message histories cannot be so factored 
without losing the intrinsic total order characteristic of histories. It is probably 
necessary to allow messages to contain variant types, as well as embedded lists 
or bags. 

Another open issue is the language for expressing the selection predicates, the 
transforms^ and the summary functions used by expand and collapse. We want 
to be able to express problems like “cheapest current offer for a Saab” with- 
out elaborate programming. There is a tradeoff between convenience, expressive 
power, and analyzability. 

Another issue is how to handle access control. In this model, access control 
can involve more than merely saying “this user may subscribe from this space” . 
Some subscriptions require all messages to be archived forever, while others allow 
messages to be expired relatively quickly — these differences have consequences 
for physical resource requirements on broker servers, so there should be a way 
to allow access control to take these consequences into account. 



4.2 Scalability 

There are many dimensions of scalability. In this section, we deal with the poten- 
tially large fanout of select, transform, or summarization arcs from a single in- 
formation space. In a large application, or in an anonymous environment such as 
home shopping where a single information space may be advertised very widely, 
the number of subscribers may be very large — perhaps in the tens of thou- 
sands or more. In this environment it becomes necessary to deploy algorithms 
that quickly match events against a large number of potential subscriptions — 
we refer to this problem as the message matching problem. Though the number 
of subscriptions to an information space may be large, say N (where N may be 
10,000 or more), we expect only a few subscriptions to match any single event. 
Efficient algorithms exist for solving the message matching problem in messag- 
ing systems based on subject-based subscription, a simple table lookup based 
on the subject of the message yields a constant time algorithm, which is also 
optimal. In the more general content-based subscription systems, this approach 
does not work since different subscriptions may refer to different fields of the 
message schema. Naive solutions take 0(N) steps to solve the message match- 
ing problem. It is highly desirable to develop algorithms to solve the message 
matching problem that are sublinear in N. This is an active area of ongoing 
research [ASS+99]. 

Note that the message matching problem is complementary to the query 
problem in databases. In a database query, a single query (typically a select) is 
matched against a large amount of data — the challenge here is to develop algo- 
rithms that are sublinear in the amount of data in the database. In the message 
matching problem a single piece of data (a message) needs to be matched against 
a large number of standing queries (subscriptions). Note that the matching prob- 
lem was first studied in the context of active databases. Efficient solutions to this 
problem are thus applicable in MOMs and in databases [HCKW90]. 
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A problem similar to the message matching problem arises in the context 
of message flow graphs when there is a large fanout of arcs between a single 
information space and other information spaces representing states. By analyzing 
the multiple summarization functions we may be able to avoid the need to make 
multiple copies of the same state update and to exploit the fact that if one state 
doesn’t change as a result of a message, a set of related states will also not 
change. 



4.3 Distributed Implementation 

The above solutions for matching have to be modified to reflect the fact that 
the message flow graph will typically be implemented over a geographically dis- 
tributed network of server processes, which we call message brokers. Message 
brokers must combine the functions of routing and multicasting with the func- 
tions of implementing selections, transformations, and summarizations. Thus it 
becomes necessary to develop distributed solutions to these problems — this is 
an active area of ongoing research [BCM+99]. 

Consider two naive solutions to this problem: 

— Perform message matching at the publisher and use the result of the match- 
ing to route to the destinations. With this solution, straightforward routing 
techniques will not work when there are a large number of clients, since (a) 
point-to-point routing will not take advantage of common paths, (b) routing 
based on destination lists could result in large message headers, and (c) rout- 
ing based on multicast groups could require a very large number of groups 
(if there are N subscribers, the system may need as many as 2^ multicast 
groups). 

— Broadcast the message to all message brokers and let each message broker 
match the message against its locally connected subscribers. This solution 
is likely to waste communication bandwidth in very large networks, since if 
subscriptions are sufficiently selective, messages will often be sent to a broker 
none of whose attached clients requires the message. 

Approaches to the problem being studied include: (1) performing partial 
matching at each broker, forwarding messages (either by conventional point- 
to-point or by multicast) to the subset of neighboring brokers requiring the 
message; (2) matching messages to a combination of pre-allocated multicast 
groups; (3) exploiting the relationships between the subscriptions to reduce the 
combinatorial possibilities of multicast groups. 

The existence of message transformations complicates the situation even fur- 
ther — some transformations can be moved and/or replicated on multiple bro- 
kers; others cannot, either because they involve data that cannot be moved (e.g. 
a large database mapping names to social security numbers), or because they 
involve “opaque” algorithms not visible to the middleware. 
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4.4 Message Reliability 

The fault model that is typically implemented in traditional group communi- 
cation systems — that a failed or slow process is automatically removed from 
the group [BT87] — is inappropriate for MOM-based applications. In MOM, 
the message flow graph is viewed as an abstract reliable entity: Subscriptions 
are persistent, and messages may not be lost, permuted, or duplicated, nor must 
spurious messages be generated (unless such distortions can be masked as a 
result of filtering or state equivalence). The implementation must preserve the 
appearances of persistence even though in the message distribution scenarios 
shown above, the distributed system may contain intermittently connected and 
intermittently faulty hardware components. This means that when a faulty sub- 
scriber reconnects, it must be possible to either deliver all the messages that it 
has missed, or else to compute (via analysis of the effects on the state of inter- 
est) a shorter set of messages which will re-create this state. Unlike with group 
communication systems, it is not sufficient to report to a faulty or disconnected 
subscriber that its subscription has been dropped. For example, the Replenish- 
ment Analysis teams must continue to receive inventory updates after a dropped 
connection is reestablished. We need algorithms to provide this appearance of 
persistence in a distributed network of message brokers. We also need algorithms 
to exploit the cases where state equivalence permits dropping of messages, and 
to exploit the properties of state equivalence to deliver compressed message se- 
quences after a reconnection. 

4.5 Message Ordering 

Information spaces support the abstraction of a total order on messages. Since 
subscribers specify their interest in states derived from message histories, the 
middleware has the opportunity of relaxing total order deliveries for specific 
clients while preserving the meaning of the overall message history. This is in 
contrast with the approach taken by group communication systems in which 
ordering guarantees are driven by low level protocol options (e.g., publisher 
ordering, causal ordering, etc.) [BCJ+90]. 

For instance, if a subscriber is subscribing to the current price of a set of 
advertised items, the subscriber may be sensitive to the order of the last two 
updates to the same item, since the current price depends upon which update 
is first. The subscriber may not be sensitive to the order of earlier updates to 
that item or to the order of updates to different items. This gives the system 
the flexibility to weaken the ordering requirement where it is legitimate to do so 
while preserving it in the cases where it matters. However, it gives rise to open 
issues of how these situations are detected. It also creates the opportunity for 
optimistic delivery of out-of-order messages, as discussed below. 

4.6 Optimistic Delivery 

Efficient message delivery implementations that address fault-tolerance and or- 
dering make a distinction between the receipt of a message and its actual delivery 
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to a client — it is often necessary for the system to delay the delivery of a re- 
ceived message until certain control messages have arrived, such as for example, 
notifications that the data is stable and that no earlier messages are still en 
route. It is desirable wherever possible to deliver messages optimistically with- 
out waiting for this control information. In the simplest cases, the subscriber’s 
state of interest doesn’t depend upon order or isn’t affected by extraneous un- 
logged messages. However, in more interesting cases, the state of interest does 
depend upon order, but the state interpretation makes it possible for “recovery” 
messages to retrieve the correct state after an out-of-order or unlogged message 
has arrived. For example, in the home shopping example, it may be that an offer 
to sell for $20000 is followed by an offer to sell for $30000. If the offer for $30000 
arrives first, it can still be immediately delivered; when the offer for $20000 ar- 
rives later, the recovery action is to deliver it if it is for a different item than the 
$30000 offer, and to discard it as obsolete if it is for the same item. 

It is an open problem to analyze a set of subscriptions to state derived from 
message histories, and determine (a) under which conditions messages can be 
optimistically delivered without waiting for control messages, and (b) what “re- 
covery” messages must be inserted if it is later determined that the state needs 
to be corrected. 



4.7 Topology Changes 

End-users don’t care about the topology of the underlying network. Ideally (a) 
it should be possible to reconfigure the topology of the underlying network non- 
disruptively, and (b) it should not require complex planning on the part of a 
network administrator to configure. Any approach to the topology reconfigura- 
tion problem must address scenarios in which multiple organizations may own 
parts of the communications links and logging disks, and these organizations 
must be able to reconfigure and/or control use of their facilities. 



4.8 Security 

MOM needs at least three varieties of security: (1) control of who may publish 
to, or subscribe from the information spaces of the virtual message flow graph, 
(2) control of the physical resources, (3) privacy protection of the data that flows 
between publishers and subscribers. Any security solution must accept the fact 
that there is no single “application” and no single owner of the whole network. 

There are open issues about: (1) preventing a user from overloading system 
resources by either generating messages too quickly or by requesting states that 
make it impossible to discard any old messages; (2) how to deal with clients to 
the same information space from different organizations having different access 
rights; and (3) the tension between the requirement for brokers to do content- 
based matching and the requirement for some brokers not to be able to interpret 
the content. 
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5 Alternative Approaches 

Other technologies, including object request brokers (ORBs) and database man- 
agement systems (DBMSs) are being used to glue applications together in the 
kinds of scenarios presented in this paper [Gro98,Hsi92,WfM,BN83]. However, 
each of these approaches has its limitations for the purpose of MOM applications, 
as described below. 



5.1 Remote Method Invocation (RMI) Systems 

Object request brokers (e.g., [Gro98]) and in general RPC systems [BN83] can 
be used to glue applications by having one application call methods of objects 
in another remote application. The interfaces supported by an application are 
specified in an interface definition language (IDL) which are compiled into stubs 
for the caller and into language templates for the callee. RMI systems have 
several shortcomings that make them unsuitable for MOM applications: 

— Application evolution: With this approach, applications tend to get tightly 
integrated, right from the design stage. Changes are difficult if not impos- 
sible to make after an application has been deployed. Also, since remote 
method invocation is a point-to-point concept, it is not possible to interpose 
new applications between existing information flows without disrupting the 
existing applications. 

— Disconnected operation: RMI systems support a synchronous style of inter- 
action — this makes them unsuitable in environments where clients may 
disconnect. 



5.2 Database Systems 

In general, database systems are optimized for a different set of applications than 
the ones that are presented in this paper. For example, databases are optimized 
for queries over a large amount of saved data as opposed to matching a message 
against a large number of standing queries or computing a summary state from a 
sequence of messages. Also, database systems usually support a small number of 
views whereas MOM systems must support a large number of views and must be 
optimized for frequent view updates. Furthermore, the overhead of distributed 
transactions in databases is prohibitively large for MOM applications. 

The database community has developed a variety of techniques to use shared 
databases to glue together applications in a distributed environment. With this 
approach, one application adds data to a shared database and another applica- 
tion retrieves the data from it. Shared databases can be used in several config- 
urations for this purpose, but all of them have their limitations: 

— Pull-based: The receiving application may poll the shared database for new 
“incoming” data; this unnecessarily introduces extra network traffic. 
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— Active Databases: The receiving application may be alerted about new data 
in the database using a trigger mechanism. However, the trigger mechanism 
may not scale over a large number of receivers interested in different kinds 
of updates to the shared database. 

— Client-server architecture: In this architecture, distributed clients access a 
centralized database. This approach offers limited scalability, and does not 
have the ability to cross organizational boundaries. Changes must first prop- 
agate to the centralized database before being sent to interested viewers. 

— Distributed database architecture: In this architecture, the database is repli- 
cated at multiple sites. In many scenarios for glueing applications together, 
the replication guarantees provided by distributed databases may be too 
strong. 

— Federated databases: In this architecture [KK90,Hsi92,GL94], a collection of 
independently designed databases is made to function as a single database. 
This involves name conflict resolution, schema conversion, and the execution 
of transactions on multiple databases as a single global transaction. Although 
this approach may be appropriate for organizing multiple databases within 
a single company, or for merging two companies together, it is unlikely to be 
feasible to run global serial transactions across multiple organizations and 
thousands of anonymous subscribers worldwide. 

In general, a database used as a communication channel is too heavyweight 
as glue between applications. This approach has a significant administrative 
overhead and may not provide the same kind of communication throughput 
that a more specialized communication channel could provide. Thus, although 
databases cannot be a total solution, especially given heavyweight commit pro- 
tocols that we often don’t need in the message flow graph solutions, databases 
are still potential clients of MOM systems. 



5.3 Group Communication Systems 

Group communication systems (e.g., [Pow96]) can be used to glue applications 
by having applications join process groups meant for exchanging particular 
types of messages. This technique is commonly used to implement subject-based 
pub/sub, where a subject (or a channel) is implemented as a process group. 
In fact, we view MOM as a natural evolution of the group communication ap- 
proach. However, this approach has several shortcomings if used to support the 
full spectrum of MOM applications: 

— Flexibility: The group communication based approach imposes a fixed sub- 
ject structure on all applications — this reduces the flexibility of the overall 
system. In large systems, it may be necessary for different applications to 
select messages based on different fields in a message — a subject structure 
cannot capture this requirement. 

— Scalability: Group communication system implementations tend to be tightly 
coupled, thus it is natural to deploy group communication systems over 
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small numbers of computers (100s) on a tightly coupled network (e.g., a 
LAN). Scaling group communication to larger numbers of computers and 
onto WAN environments is an open area of active research [BFH97] . 

— Fault model: The fault model that is typically implemented in group com- 
munication systems — that a failed process is automatically removed from 
the group — is inappropriate for MOM applications. In MOM, subscriptions 
are persistent, when a failed process recovers it needs to be updated with all 
the messages that it did not receive. 

— Opaque messages: In general, group communication systems do not interpret 
the content of messages. This forces them to support qualities of service based 
on low level properties of the protocol, not on the semantics of messages. 
MOM systems can get more information from applications and use it to 
provide various qualities of service more effectively. 



5.4 Workflow Systems 

Workflow systems (e.g., [WfM,Lut94]) are used for coordinating potentially dis- 
tributed tasks via a specification of the sequence and control of tasks. While 
these systems are typically used to solve problems at a “higher- level” , they may 
also be used to glue applications by treating each application as an “activity” 
that communicates with other “activities” via a workflow manager (which is the 
software component that controls the flow of work between activities). However, 
the major shortcomings of this approach when used for integrating applications 
are: 



— Workflow specifications are relatively static in nature — activities and their 
interactions do not change much once the flow has been defined. Applications 
requiring integration, on the other hand, may need to support changes to 
subscriptions at a frequent rate. 

— Workflow managers are centralized in practice. This may limit the scalability 
and the throughput of the system. Building distributed workflow systems is 
an active area of ongoing research [PPHC98]. 

6 Related Work in Content-based Publish-Subscribe 
Systems 

Several projects [WIS], such as SIENA [Car98], READY [GKP99], Elvin [SA97], 
and Gryphon (http:/ /www. research. ibm.com/gryphon), are exploring the use of 
content-based publish-subscribe systems as the basis for various MOM appli- 
cations. While the motivations for the work in these projects are similar, the 
approaches they are pursuing are different in important respects. 

Erom a model point of view, all of the above systems support rich sub- 
scription languages which approach the expressiveness of SQL. Some of the sys- 
tems, e.g., READY, also support the use of temporal relationships between mes- 
sages for expressing “compound events” . As described earlier, the content-based 
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publish-subscribe model can be generalized to include transformations, merges, 
and stateful operations in the single framework of message flow graphs - this 
approach is being explored by the Gryphon project. None of the above projects 
has explored the notion of cross-domain message flows in any detail. 

From an implementation point of view, none of the above projects has ad- 
dressed all the research issues mentioned earlier in this paper. The first 
scalable solutions for matching and multicasting have appeared 
recently [ASS+99,BCM+99]. The SIENA project has explored issues relating to 
efficient propagation of subscriptions. Questions of efficient matching and multi- 
cast, message reliability, message ordering, and optimistic delivery are currently 
being explored in the Gryphon project. 
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Abstract. Uniform Reliable Broadeast (URB) is a communication prim- 
itive that requires that if a process delivers a message, then all correct 
processes also deliver this message. A recent paper [HR99] uses Knowl- 
edge Theory to determine what failure detectors are necessary to imple- 
ment this primitive in asynchronous systems with process crashes and 
lossy links that are fair. In this paper, we revisit this problem using a 
different approach, and provide a result that is simpler, more intuitive, 
and, in a precise sense, more general. 



1 Introduction 

Uniform Reliable Broadeast (URB) is a communication primitive that requires 
that if a process delivers a message, then all correct processes also deliver this 
message [HT94]. A recent paper [HR99] uses Knowledge Theory to determine 
what failure detectors are necessary to implement this primitive in asynchronous 
systems with process crashes and fair links (roughly speaking, a fair link may 
lose an infinite number of messages, but if a message is sent infinitely often 
then it is eventually received).^ In this paper, we revisit this problem using an 
algorithmic-reduction approach [CHT96], and provide a result that is simpler, 
more intuitive, and, in a precise sense, more general, as we now explain. 

[HR99] considered systems where up to / processes may crash and links are 
fair, and used Knowledge Theory to show that solving URB in such a system 
requires a generalized f -useful failure deteetor (denoted in here). Such a 
failure detector is parameterized by / and is described in Fig. 1. [HR99] shows 
that when / = n or / = n — 1 , G-^ is equivalent to a perfeet failure deteetor. 

In this paper, we revisit this problem using the approach in [CHT96], and 
give a simpler characterization of the failure detectors that can solve URB in 
systems with process crashes and fair links. More precisely, we prove that the 

* Research partially supported by NSF grant CCR-9711403 and by an Olin Fellowship. 
^ [HR99] actually studies a problem called Uniform Distributed Coordination. This 
problem, however, is isomorphic to URB: init and do in Uniform Distributed Coor- 
dination correspond to broadcast and deliver in URB, respectively. 
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A generalized failure deteetor [HR99] outputs a pair (S, k) where S' is a subset of pro- 
cesses and k is a, positive integer. Intuitively, the failure detector outputs (S, k) to report 
that k processes in S are faulty. In a run r, the failure detector event suspeet ^{S , k) 
is said to be f -useful for r if (a) S contains all processes that crash in r, and (b) 
n — |S| > min(/, n — 1) — k. A generalized failure detector is f -useful if, for all runs 
r and processes p, the following two properties hold (where rp{t) denotes the prefix of 
run r at process p up to time t): 

— If suspeetp(S, k) is in rp(t) then there is a subset S' C S such that = /c and for 
all g G 5", we have that erashq is in rq{t). 

— If p is correct, then there is a /-useful failure-detector event for r in rp(t), for 
some t. 

Fig. 1. Definition of a generalized /-useful failure detectors. 



weakest failure detector for this problem is a simple failure detector denoted 0. 
0 outputs a set of processes that are currently trusted to be up,^ such that: 

Completeness: There is a time after which processes do not trust any process 
that crashes. 

Aeeuraey: If some process never crashes then, at every time, every process trusts 
at least one process that never crashes. 

This simple characterization of the weakest failure detector for URB is more 
general than the one given in [HR99], in the sense that it holds for any system 
with fair links, regardless of f or any other types of restrietions or dependeneies 
on proeess erashesA To illustrate this point, consider the following three systems 
with n processors {pi,P 2 , • • • ,Pn}‘ 

1. In system every processor may crash, except that we assume that pi 
and p 2 cannot both crash in the same run (this assumption makes sense if, 
for example, pi andp 2 are configured as symmetric primary /backup servers). 
Note that in S'!, up to / = n — 1 processors may crash in the same run. 

2. In system £' 2 , every processor may crash, except that processor pi is a fault- 
tolerant highly- available computing server that crashes only when it is left 
alone in the system (this assumption is not unreasonable: in some existing 
systems, processes kill themselves if they are unable to communicate with 
a minimum number of processes). Note that in *52, up to / = n processors 
may crash in the same run. 

^ Some failure detectors in the literature output a set of processes suspected to be 
down; this is just the complement of the set of processes that are trusted to be up. 
^ If one assumes that a majority of processes does not crash, then URB can be solved 
without any failure detector [BCBT96]. As we explain in Sect. 11, this does not 
contradict our result. 
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3. In system S' 3 , the number of processes that crash is bounded, but this 
bound / is not given. Moreover, there are some additional restrictions and 
dependencies on process crashes (e.g., if more than half of the processes crash 
then a certain process pi commits suicide) but these are also not given. 

What is the weakest failure detector for solving URB in each of Si, S 2 and S' 3 ? 
By our result, the answer is simply 0. 

In contrast, the result in [HR99] cannot be applied to Si, S 2 and Ss, as we 
now explain. For S 3 , this is obvious because / is not given. For Si, the value 
of /, namely n — 1 , is given, but the result in [HR99] cannot be used, because 
it assumes that every set of (up to) n — 1 processes may crash in a run — an 
assumption that does not hold for Si. Similarly, for S 2 , one cannot apply the 
result in [HR99]. 

Since, in some sense, both G-^ and 0 are “minimal” for URB, an important 
question is now in order: What is the relation between and 0? To answer 
this question, we use the notions of failure patterns and environments [CHT96] . 
Roughly speaking, a failure pattern indicates, for each process p, whether p 
crashes and, if so, when. An environment f is a set of failure patterns; and a 
system with environment £ is one where the process crashes must match one of 
the failure patterns in £. Intuitively, environments allow us to express restric- 
tions on process crashes, such as “either pi or p 2 , but not both, may crash” (so 
environments can be used to formally define the systems Si and S 2 described 
earlier). A commonly- used environment in the literature is , the set of all fail- 
ure patterns in which at most / processes crash: A system with environment £-^ 
allows up to / process crashes, but there are no other constraints or dependen- 
cies, i.e., any subset of / process may crash, and these crashes can occur at any 
time. 

We can now compare G^ and 0. Roughly speaking, 0 is the weakest failure 
detector for URB regardless of the environment £, while G-^ is necessary and 
sufficient for URB in environment £-^ . When £ = £^ , there is an algorithm that 
transforms G^ into 0 , and so 0 is at least as weak as G^ in environment 

An important difference between [HR99] and this paper is that [HR99] uses 
Knowledge Theory [FHMV95] to establish and state its results, while we use al- 
gorithmic reductions [CT96] . An advantage of the algorithmic reduction method 
over the knowledge approach, is that the former allows the derivation of a 
stronger result: in a nutshell, the knowledge approach determines only what in- 
formation about failures processes know, while the algorithmic reduction method 
determines what information about failures processes know and ean effeetively 
eompute. Specifically, the result in [HR99] is that, in order to solve URB, pro- 
cesses must know the information provided by G^ . This does not automatically 
imply that processes can actually compute G^ 

^ This is modulo a technicality due to a difference in the two models: in [HR99] all 
the failure detector events are “seen” by processes, while here processes can “miss” 
some failure detector values. 

^ In Knowledge Theory, processes may know facts that they cannot actually compute. 
For example, if the system is sufficiently expressive, they know the answer to every 
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In contrast, the algorithmic reduction given in this paper shows that if pro- 
cesses can solve URB with some failure detector P, then they can use V to 
compute failure detector 0. This reduction implies that V is at least as strong 
as 0 in terms of problem solving: if processes can solve a problem with 0, they 
can also solve it with V (by first using V to compute 0). Note we would not 
be able to say that V is at least as strong as 0 (in terms of problem solving) 
if V only allowed processes to know (but not compute) the information provided 
by 0. 

Finally, there is another difference between our approach and the one 
in [HR99], namely, the universe of failure detectors that is being considered. To 
understand the meaning of a statement such as “P is the weakest failure detec- 
tor...”, or “P is necessary...”, one needs to know the universe of failure detectors 
under consideration (because it is among these failure detectors that V is the 
“weakest” or “necessary”). In our paper, the universe of failure detectors is ex- 
plicit and clear: a failure detector is a function of the failure pattern — a natural 
definition that is widely used [CT96, CHT96, ACT98, HMR97, OGS97, YNG98, 
LH94] etc. The universe of failure detectors in [HR99], however, is implicitly 
defined, and the exact nature and power of the failure detectors considered are 
not entirely clear. This issue is further discussed in Sect. 8. 

In summary, in this paper we consider the problem of determining the weakest 
failure detector for solving URB in systems with process crashes and lossy links 
— a problem that was first investigated in [HR99] . In [HR99] , this problem was 
studied using the framework of Knowledge Theory. In this paper, we tackle this 
problem using a different approach based on the standard failure detector models 
and techniques of [GHT96]. The results that we obtain are simple, intuitive and 
general. More precisely: 

1. We provide a single failure detector 0, and show that it is the weakest failure 
detector for URB, in any environment. In particular, our result holds even 
if / is not given. 

In environment 0 is at least as weak as . 

2. 0 is simple and a natural candidate for solving URB. As a result, the algo- 
rithm that uses 0 to solve URB in any environment f , is immediate. 

3. Our results are derived and can be understood from first principles (they do 
not require Knowledge Theory). 

4. Our “minimality” result is in terms of effective computation, not knowledge: 
roughly speaking, if processes can solve URB, we show how they can com- 
pute 0 (this implies knowledge of the information provided by 0; but the 
converse does not necessarily hold). 

5. The universe of failure detectors (with respect to which our minimality result 
holds) is given explicitly through a simple definition. 

The paper is organized as follows. Our model is described in Sect. 2. In Sect. 3, 
we explain what it means for a failure detector to be weaker than another one. 

unsolved problem in Number Theory, and they also know whether any given Turing 
Machine halts on blank tape. 
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Section 4 defines the uniform reliable broadcast problem. Failure detector Q is 
defined in Sect. 5, and in Sect. 6, we show how to use it to implement uniform 
reliable broadcast in systems with process crashes and fair links. In Sect. 7 we 
show that Q is actually the weakest failure detector for this problem. In Sect. 8, 
we briefly discuss the nature and power of failure detectors, and in Sect. 9 we 
consider the relation between and 0. Related work is discussed in Sect. 10 
and we conclude the paper in Sect. 11. 

2 Model 

Throughout this paper, in all our results, we consider asynchronous message- 
passing distributed systems in which there are no timing assumptions. In partic- 
ular, we make no assumptions on the time it takes to deliver a message, or on rel- 
ative process speeds. The system consists of a set of n processes 77 = {1,2, ...,n} 
that are completely connected by point-to-point (bidirectional) links. The sys- 
tem can experience both process failures and link failures. Processes can fail by 
crashing, and links can fail by dropping messages. The model, based on the one 
in [CHT96], is described next. 

We assume the existence of a discrete global clock — this is merely a fictional 
device to simplify the presentation and processes do not have access to it. We 
take the range T of the clock’s ticks to be the set of natural numbers. 



2.1 Failure Patterns and Environments 

Processes can fail by crashing, i.e., by halting prematurely. A failure pattern F 
is a function from T to 2^. Intuitively, F{t) denotes the set of processes that 
have crashed through time t. Once a process crashes, it does not “recover”, i.e., 
Vt : F{t) C F{t + 1). We define crashed(F) = Uter^(^) correct(F) = 
77 \ crashed(F). If p G crashed(F) we say p erashes (or is faulty) in F and if 
p e correct (F) we say p is eorreet in F. 

An environment f is a set of failure patterns. As we explained in the intro- 
duction, environments describe the crashes that can occur in a system. 

Links can fail by dropping messages, but we assume that links are fair. 
Roughly speaking, a fair link from p to q may intermittently drop messages, and 
may do so infinitely often, but it must satisfy the following “fairness” property: 
if p repeatedly sends some message to q and q does not crash, then q eventually 
receives that message. This is made more precise in Sect. 2.3. 



2.2 Failure Detectors 

Each process has access to a local failure detector module that provides (possibly 
incorrect) information about the failure pattern that occurs in an execution. A 
failure deteetor history 77 with range 7^ is a function from U xT to F. 77 (p, t) is 
the output value of the failure detector module of process p at time t. A failure 
deteetor 77 is a function that maps each failure pattern F to a non-empty set 
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of failure detector histories with range 1Zt) (where 1Zt) denotes the range of the 
failure detector output of V). V{F) denotes the set of possible failure detector 
histories permitted by V for the failure pattern F. 



2.3 Runs of Algorithms 

An algorithm A is a collection of n (possibly infinite-state) deterministic au- 
tomata, one for each process in the system. Computation proceeds in atomic 
steps of A. In each step, a process may: receive a message from a process, get 
an external input, query its failure detector module, undergo a state transition, 
send a message to a neighbor, and issue an external output. 

A run of algorithm A using failure deteetor P is a tuple R = (F, /, F, T) 

where F is a failure pattern, Hx> G F(F) is a history of failure detector V for 
failure pattern F, / is an initial configuration of A, S is an infinite sequence of 
steps of A, and T is an infinite list of increasing time values indicating when 
each step in S occurs. 

A run must satisfy some properties for every process p\ If p has crashed by 
time t, i.e., p G F(t), then p does not take a step at any time t' > t; if p is 
correct, i.e., p G correct(F), then p takes an infinite number of steps; and if p 
takes a step at time t and queries its failure detector, then p gets Hx>{p^ t) as a 
response. 

A run must also satisfy the following “fair link properties” for every pair of 
processes p and q\ 

— Fairness: If p sends a message m to g an infinite number of times and q is 
correct, then q eventually receives m from p. 

— Uniform Integrity: If q receives a message m from p then p previously sent m 
to g; and if g receives m infinitely often from p, then p sends m infinitely 
often to g. 

3 Failure Detector Transformations 

As explained in [CT96, CHT96], failure detectors can be compared via algorith- 
mic transformations. We now explain what it means for an algorithm to 

transform a failure detector V into another failure detector V in an environ- 
ment £. Algorithm Tx>^v uses V to maintain a variable V'^ at every process p. 
This variable, reflected in the local state of p, emulates the output of V' at p. 
Let Hx>> be the history of all the V' variables in a run R of i*e., Hx>> (p, t) 

is the value of V'^ at time t in run R. Algorithm transforms V into V' 

in £ if and only if for every F G f and every run R = (F, /, S', T) of 

using F, we have Hx>> G V'{F). Intuitively, since Tx>^v is able to use V to em- 
ulate F', F provides at least as much information about process failures as F' 
does, and we say that F' is weaker than V in £. 

Note that, in general, need not emulate all the failure detector histo- 

ries of V' (in environment £); what we do require is that all the failure detector 
histories it emulates be histories of F' (in that environment). 
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4 Uniform Reliable Broadcast 

Uniform Reliable Broadeast (URB) is defined in terms of two primitives: 
broadcast(m) and deliver(m). We say that process p broadeasts message m if p 
invokes broadcast(m). We assume that every broadcast message m includes the 
following fields: the identity of its sender, denoted sender{m)^ and a sequence 
number, denoted seq{m). These fields make every broadcast message unique. 
We say that q delivers message m if q returns from the invocation of deliver(m). 
Primitives broadcast and deliver satisfy the following properties [HT94]: 

— Validity: If a correct process broadcasts a message m, then it eventually 
delivers m. 

— Uniform Agreement: If some process delivers a message m, then all correct 
processes eventually deliver m. 

— Uniform Integrity: For every message m, every process delivers m at most 
once, and only if m was previously broadcast by sender (m). 

Validity and Uniform Agreement imply that if a correct process broadcasts a 
message m, then all correct processes eventually deliver m. 

5 Failure Detector 0 

We now define failure detector 0. Each failure detector module of 0 outputs 
a set of proeesses that are trusted to be up, i.e., IZe = 2^. For each failure 
pattern F, 0{F) is the set of all failure detector histories H with range IZq that 
satisfy the following properties: 

— [0-eompleteness]: There is a time after which correct processes do not trust 
any process that crashes. More precisely: 

3t eT^Vp e correct(F), Vg' G crashed(F), Vt' > t : q ^ H{pR') 

— [0-aeeuraey]: If there is a correct process then, at every time, every process 
trusts at least one correct process. More precisely: 

crashed(F) 7^ 77 ^ Vt G T, Vp G 77 \ F(t), 3q G correct(F) : q e H{p,t) 

Note that a process may be trusted even if it has actually crashed. Moreover, 
the correct processes trusted by a process p is allowed to change over time 
(in fact, it can change infinitely often), and it is not necessarily the same as 
the correct process trusted by another process q. 

6 Using 0 to Implement Uniform Reliable Broadcast 

The algorithm that implements URB using 0 is shown in Fig. 2. When ambi- 
guities may arise, a variable local to process p is subscripted by p. To broadcast 
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1 For every process p: 

2 

3 To execute broadcast(m): 

4 got[m] ^ {p} 

5 fork task diffuse{m) 

6 return 

7 

8 task diffuse{m): 

9 while true do 

10 send m to all processes 

11 d ^ Vp { (i is the list of processes trusted to be up} 

12 if d C got[m] and p has not delivered m 

13 then deliver(?n) 

14 

15 upon receive m from q do 

16 if task diffuse{m) has not been started yet then 

17 got[m]^{p,q} 

18 fork task diffuse{m) 

19 else got[m] ^ got[m] U {g} 



Fig. 2. Implementing Uniform Reliable Broadcast using V = 0 



a message m, a process p first initializes gotp[m] to {p|; this variable represents 
the processes that p knows to have received m so far. Process p then forks task 
diffuse{m). In diffuse{m)^ process p periodically sends m to all processes, and 
checks if got [m] contains all processes that are currently trusted by p; when that 
happens, p delivers m if it has not done so already. When process p receives m 
from a process it starts task diffuse{m) if it has not done so already. 

Theorem 1. Consider an asynchronous distributed system with process crashes 
and fair links, and with environment E. The algorithm in Fig. 2 implements 
URB using 0 in £. 

The proof is straightforward and can be found in [ATD99] . 



7 The Weakest Failure Detector for Uniform Reliable 
Broadcast 

We now show that, in any environment, a failure detector V that can be used 
to solve URB can be transformed to 0. More precisely, we have the following 
theorem: 
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Theorem 2. Consider an asynchronous distributed system with process crashes 
and fair links, and with environment E. Suppose failure detector V can he used 
to solve URB in E. Then V can he transformed in E to the 0 failure detector. 

We now proceed to prove this theorem. Let E be an environment, P be a 
failure detector that can be used to solve URB in E, and Aurb be the URB 
algorithm that uses V. We describe an algorithm that transforms V into 0 

in E. Intuitively, this algorithm works as follows. 

Consider an arbitrary run of Tx>^e using V, with failure pattern F ^ E and 
failure detector history H G F{F). Processes periodically query their failure 
detector V and exchange information about the values of H that they see in 
this run. Using this information, processes construct a directed acyclic graph 
(DAG) that represents a “sampling” of failure detector values in H and some 
temporal relationships between the values sampled. To illustrate this, suppose 
that qo queries its failure detector V for the ko-th time and sees value do; qo then 
reliably broadcasts the message [qo,do,ko] (it can use Aurb to do so). When 
a process qi receives [qo,do,ko], it can add vertex [qo,do,ko] to its (current) 
version of the DAG. When qi later queries V and sees the value d\ (say this is 
its ki-th query), it adds vertex [qi, d\, k\] and edge [go, do, ko] [gi, di, ki] to its 
DAG: This edge indicates that go saw do (in its ko-th query) before gi saw di (in 
its ki~th query). By periodically sending its current version of the DAG to all 
processes, and incorporating all the DAGs that it receives into its own DAG, a 
process can construct an ever increasing DAG that includes the failure detector 
values seen by processes and some of their temporal relationships. 

It turns out that a process p can use its DAG to simulate runs of Aurb with 
failure pattern F and failure detector history H. These are runs that could have 
occurred if processes were running Aurb instead of Tu>^ 0 . 

To illustrate this, let p be a process, and consider a path in its DAG, say 
[go, do,ko],[qi,di,ki], . . . , [g^, d^, ki]. In process p uses this path to simu- 

late a run Rsim of Aurb- In Rsim^ Qo takes the 0-th step, gi takes the 1-st step, 
g 2 takes the 2-nd step, and so on. In the 0-th step, go broadcasts mo- More- 
over, for every j, in the j-th step process qj sees failure detector value dj and 
receives the oldest message sent to it that it has not yet received (if there are 
no such messages, it receives nothing). It turns out that, if failure pattern F has 
some correct process, then process p can extract from Rsim a list of processes 
that contains at least one such a correct process. To see how, consider the step 
of Rsim when a process first delivers mo, and suppose this is the k-th step. Then, 
among processes {go, . . . , g/c} (those that took steps before the delivery of mo), 
there is at least one that never crashes in F. If that were not the case, we could 
construct another run Rbad of Aurb with failure pattern F and failure de- 
tector history id, where (1) up to the k-th step, processes behave as in Rsim^ 
(2) after the d-th step, processes {go,---,d/c} all crash, and all messages sent 
by these processes to other processes are lost and (3) from the {k + l)-st step 
onwards, the correct processes (in F) take steps in a round-robin fashion. Note 
that in Rbad, (1) process qk delivers mo at the k-th step, (2) correct processes 
(in F) only take steps after the k-th step, (3) these processes never receive a 
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message sent by the time of the k-th step, and so (4) correct processes (in F) 
never deliver mo — a contradiction. Thus, the list {go, • • • , contains at least 
one correct process (in F), and so p can achieve the 0-accuracy property by 
outputting this list. 

The list {go, • • • , ^/e} that p generates, however, may contain processes that 
crash (in F). Thus, to achieve 0-completeness, p must continuously repeat the 
simulation above to generate new {go,---,^/c} lists, such that eventually the 
lists contain only correct processes (in F). In order to guarantee that, p must 
ensure that the path [go, do, A^o], fei, ],•••, ki] that it uses to extract 

{go, . . . , g/c} eventually includes only vertices of processes that do not crash. That 
will be true if all the processes that crash in F, do so before go obtains do at its ko~ 
th step. Therefore, process p can achieve 0-completeness (as well as 0-accuracy) 
by simply periodically reselecting a new path [go, do, /co], [gi, di, /ci], . . . , [g^, d^, ki] 
so that [go, do, ko] is a “recent” vertex in its DAG. 

Having given an overall account of how the transformation works, 

we now explain it in more detail. In what follows, let F be a sequence of 
pairs consisting of a process name and a failure detector value, that is, S := 
([go, do], [gi, di], . . . , [g/c, d/c]). Let mo be an arbitrary fixed message. Given S', 
we can simulate an execution of Aurb in which (1) process go initially invokes 
broadcast (mo) and (2) for j = 0, . . . , k, the j-th step of Aurb is taken by pro- 
cess Qj] in that step, qj obtains dj from its local failure detector module, and 
receives the oldest message addressed to it that it has not yet received (if there 
are no such messages, it receives nothing). We define Delivered (S) to be true if 
process Qk delivers mo in the k-th. step of this simulation. 

The detailed algorithm that transforms F to 0 is given in Fig. 3. As we 
explained above, each process p maintains a directed acyclic graph DAGp^ whose 
nodes are triples [g, d, seq]. The transformation algorithm has three tasks; in the 
first task, a process p periodically queries its local failure detector, creates a new 
node [p, d, eurr] in DAGp and adds an edge from all other nodes in DAGp to 
this new node. Then, p uses Aurb to broadcasts its new DAGp to all processes. 
In the second task, upon the delivery of < DAGq from a process g, process p 
merges its own DAGp with DAGq. In the third task, process p loops forever. In 
the loop, p first waits until its Task 1 adds a new node to DAGp, and then waits 
until there is a path starting at this new node that satisfies Delivered. Once p 
finds such a path, it sets the output of V' to the set of all processes that appear 
in the path. Then, process p restarts the loop. 

The detailed correctness proof of this algorithm is given in [ATD99] . 

8 On the Nature and Power of Failure Detectors 

As we mentioned in the introduction, to understand the meaning of a statement 
such as “F is the weakest failure detector...”, or “F is necessary...”, we need to 
know the universe of failure detectors under consideration. For such minimality 
results to be significant, the universe of failure detectors should be reasonable. 
In particular, it should not include failure detectors that provide information 
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1 For every process p: 

2 

3 Initialization: 

4 DAG ^ 0 

5 curr < 1 

6 Vp ^ n { trust all processes } 

7 

8 cobegin 

9 1 1 Task 1 : 

10 while true do 

11 d ^ Dp 

12 curr ^ curr + 1 

13 add to DAG the node [p, d, curr] and edges from all other nodes to [p, d, curr] 

14 broadcast (DA U) { use URB algorithm to broadcast } 

15 

16 II Task 2: 

17 upon deliver(DAUg) from q do 

18 DAG ^ DAGU DAGq 

19 

20 1 1 Task 3 : 

21 while true do 

22 next ^ curr + 1 

23 wait until DAG contains a node of the form [p, *, next] 

24 wait until DAG contains a path P = ([go, do, scq ^]^ . . . , [g^, dfc, seq^]) such that 

25 (1) go = p and segQ = next and 

26 (2) De/z^;ered([go,do], . . . , [gfc,dfc]) is true 

27 D'p ^ {go, . . . , gfc} {all processes in this path } 

28 coend 



Fig. 3. Transformation oiV to V' = 0 



that have nothing to do with failures, e.g., hints on which messages have been 
broadcast, information about the internal state of processes, etc. To see why, 
suppose that a “failure detector” is allowed to indicate whether a message m 
was broadcast; then processes could use it solve the URB problem without ever 
sending any messages! Similarly, with the Consensus problem, if a “failure de- 
tector” could peek at the initial value of a process and provide this value to 
all processes, processes could use it to solve Consensus without messages and 
without oW [CHT96]. Thus, a failure detector should be defined as an oracle 
that provides information about failures only. 

In [HR99], it is not clear what information failure detectors are allowed to 
provide: On one hand, the formal model defines failure detectors as generic ora- 
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cles;^ on the other hand, their behavior is implicitly restricted by a closure axiom 
(on the set of runs of the system) that is introduced later in the paper/ The dif- 
ficulty is that this axiom is technical and quite complex; furthermore, it does not 
mention failure detectors and it captures other assumptions that are not related 
to failure detection (e.g., the fact that processes are using a full- information pro- 
tocol). Thus, the nature and power of the failure detectors that actually satisfy 
this axiom, and the universe of failure detectors under consideration, are not 
entirely clear. 

9 Relation between and 0 

In environment ^ 0 is at least as weak as that is, it is possible to trans- 
form to 0 in EG This transformation is given in Fig. 4. Initially, each pro- 
cess p sets its failure detector output to 77 (trust all processes). There are three 
concurrent tasks. In the first task, p repeatedly sends “I- am- alive” to all pro- 
cesses in the system. In the second task, when p receives one such message from 
process it adds q to the set got. In the third task, process p loops forever. In 
each iteration, p checks whether at some time has output a pair (*5, k) such 
that k > \S\ — n min(/, n — 1) and got contains the complement of S. In that 
case, p sets its failure detector output to got^ and then resets got to the empty 
set. 

Theorem 3. The algorithm in Fig. 4 transforms G^ into 0 in environment . 

The proof is simple, and can be found in [ATD99] . 



10 Related Work 

The difference between the concepts of Agreement and Uniform Agreement was 
first pointed out in [Had86] in a comparison of Consensus versus Atomic Com- 
mitment. The term “Uniform” was introduced in [GT89, NT90], where it was 
studied in the context of Reliable Broadcast. In these papers, it is shown that 
with send and receive omission failures, URB can be solved if and only if a 
majority of processes are correct. 

[BCBT96] consider systems with process crashes and fair (lossy) links, and 
addresses the following question: given any problem P that can be solved in a 
system where the only possible failures are process crashes, is P still solvable if 
links can also fail by losing messages? [BCBT96] shows that if P can be solved 

® Even though the definition of a failure detector states that it must output a set S of 
processes, and that S should be “interpreted” as processes suspected of being faulty, 
there is nothing in the definition to enforce this interpretation: the model does not 
tie the output S to the crashes that occur in a run. Thus, the formal definition allows 
a failure detector to use its output S to encode information that has nothing to do 
with failures. 

^ This is axiom A4 in [HR99]. 
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2 

3 
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6 

7 

8 

9 

10 
11 
12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 



For every process p: 

Initialization: 

Vp ^ U { trust all processes } 

got ^ 0 

cobegin 

1 1 Task 1 : 

while true do send {I- am- alive) to all processes 
II Task 2: 

upon receive (I- am- alive) from q do 
got ^ got U {g} 



II Task 3: 

while true do 

if there exists S', k such that 

(1) p got event suspeet{S,k) (from G^), 

(2) k > |S| — n + min(/, n — 1), and 

(3) got contains II \ S 

then Dp ^ got; got ^0 { trust processes in got } 

coend 



Fig. 4. Transformation of to V' = 0 in £f. 



in systems with only process crashes, then P can also be solved in systems with 
process crashes and fair links, provided that (a) P is correct-restricted^ ^ or (b) 
a majority of processes are correct (i.e., n > 2/). As a corollary of this result 
(and the fact that URB is solvable in systems with only process crashes), we get 
that URB is solvable in systems with f < n/2 process crashes and fair links. 

[HR99] is the first paper to consider solving URB in systems with fair links 
and / > n/2. More precisely, [HR99] shows that if URB can be solved in a 
system S that satisfies some axioms A1-A5, then that system can “simulate” 
a system with failure detector . This result holds even if system S has no 
failure detectors, but a different kind of oracle (axioms A1-A5 place restrictions 
on the allowable oracles). A discussion of other differences between [HR99] and 
this paper was given in Sect. 1. 



® Intuitively, a problem P is correct-restricted if its specification does not refer to the 
behavior of faulty processes [BN92, Gop92]. Note that URB is not correct-restricted. 
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11 Concluding Remarks 

In some environments, URB can be solved without failure detectors at all, and 
this seems to contradict the fact that 0 is the weakest failure detector for URB 
in any environment. There is no contradiction, however, because in such envi- 
ronments 0 can be implemented. 

For example, as we saw in the previous section, URB can be solved without 
failure detectors in an environment Emaj where a majority of processes are cor- 
rect. This does not contradict Theorem 2 because 0 can be implemented in Emaj^ 
as we now explain. 

To implement 0 in Emaj^ processes periodically send an “I-am-alive” mes- 
sage to all processes, and each process p keeps a list of processes Order p. This 
list records the order in which the last “I-am-alive” message from each process 
is received. More precisely. Order p is initially an arbitrary permutation of the 
processes, and when p receives an “I- am- Alive” message from p moves q to the 
front of Order. To obtain 0, a process p repeatedly outputs the first [(n + l)/2] 
processes in Order p as the set of trusted processes. It is easy to see why this 
implementation works: any process that crashes stops sending “I- am- alive” mes- 
sages, and so it eventually moves towards the end of Order p and remains there 
forever afterwards. Since at most [{n — l)/2j processes crash, all processes that 
crash are eventually and permanently among the last [(n — l)/2j processes in 
Order p — so they do not appear among the first [(n + 1)/2] processes. Thus our 
implementation satisfies 0-completeness. To see that it also satisfies 0-accuracy, 
note that among the first \{n H- l)/2] processes in Order p^ there is always at 
least one correct process (since no majority of processes can crash in Emaj)- 

In general, from the transformation algorithm in Fig. 3, the following obvi- 
ously holds: 

Remark 4- Consider an asynchronous distributed system with process crashes 
and fair links, and with environment E. If URB can be solved in E without any 
failure detectors then 0 can be implemented in E. 
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Abstract. Unreliable failure detectors, proposed by Chandra and 
Toueg [2], are mechanisms that provide information about process fail- 
ures. In [2], eight classes of failure detectors were defined, depending on 
how accurate this information is, and an algorithm implementing a fail- 
ure detector of one of these classes in a partially synchronous system was 
presented. This algorithm is based on all-to-all communication, and peri- 
odically exchanges a number of messages that is quadratic on the number 
of processes. To our knowledge, no other algorithm implementing these 
classes of unreliable failure detectors has been proposed. 

In this paper, we present a family of distributed algorithms that imple- 
ment four classes of unreliable failure detectors in partially synchronous 
systems. Our algorithms are based on a logical ring arrangement of the 
processes, which defines the monitoring and failure information propa- 
gation pattern. The resulting algorithms periodically exchange at most 
a linear number of messages. 



1 Introduction 

The concept of unreliable failure deteetor was introduced by Chandra and Toueg 
in [2]. These authors showed how unreliable failure detectors can be used to 
solve the Consensus problem [10] in asynchronous systems. (This was shown to 
be impossible in a pure asynchronous system by Fischer et al. [7].) Since then, a 
considerable amount of work has been devoted to study properties of the failure 
detection abstraction [1,6,9]. 

From the results of Fischer et al. and those of Chandra and Toueg, it can 
be derived the impossibility of, in asynchronous systems, implementing failure 
detectors precise enough to solve Consensus. Chandra and Toueg presented an al- 
gorithm that implements an unreliable failure detector in a partially synchronous 
system. To our knowledge, this is the only proposed algorithm implementing any 
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of the classes of unreliable failure detectors defined in [2] . In this paper we present 
more efficient alternatives to that first algorithm. 



1.1 Partial Synchrony 

Distributed algorithms can be designed under different assumptions of system 
behaviors, i.e., system models. One of the main assumptions in which system 
models can differ is related to the timing aspects. Most models focus on two 
timing attributes: the time taken for message delivery across a communication 
channel, and the time taken by a processor to execute a piece of code. Depending 
on whether these attributes are bounded or not, and on the knowledge of these 
bounds, they can be classified as synchronous, asynchronous, or partially syn- 
chronous [5]. A timing attribute is synchronous if there is a known fixed upper 
bound on it. On the other hand, it is asynchronous if there is no bound on it. 
Finally, a timing attribute is partially synchronous if it is neither synchronous 
nor asynchronous. Dwork et al. [5] consider two kinds of partial synchrony. In 
the first one, the timing attribute is bounded, but the bound is unknown. In the 
second one, the timing attribute is bounded and the bound is known, but it holds 
only after an unknown stabilization interval. Chandra and Toueg [2] propose an- 
other kind of partial synchrony, in which the timing attribute is bounded, but 
the bound is unknown and holds only after an unknown stabilization interval. 
This will be the model of partial synchrony used in this paper. 

Although the asynchronous model (in which at least one of the timing at- 
tributes is asynchronous) is attractive for designing distributed algorithms, it 
is well known that a number of synchronization distributed problems cannot be 
solved deterministically in asynchronous systems in which processes can fail. For 
instance, as we mentioned above. Consensus cannot be solved deterministically in 
an asynchronous system that is subject to even a single process failure [7], while 
it can be solved in both synchronous and partially synchronous systems [2,4,5]. 
In fact, the ability to solve these synchronization distributed problems closely 
depends on the ability to detect failures. In a synchronous system, reliable failure 
detection is possible. One can reliably detect failures using timeouts. (The time- 
outs can be derived from the known upper bounds on message delivery time and 
processing time.) In an asynchronous system, it is impossible to distinguish a 
failed process from a very slow one. Thus, reliable failure detection is impossible. 

However, even if it is sufficient, reliable failure detection is not necessary to 
solve most of these problems. As we already mentioned, Chandra and Toueg [2] 
introduced unreliable failure detectors (failure detectors that can make mis- 
takes) , and showed how they can be used to solve Consensus and Atomic Broad- 
cast. Guerraoui et al. [8] showed how unreliable failure detectors can be used to 
solve the Non-Blocking Atomic Commitment problem. 

1.2 Unreliable Failure Detectors 

An unreliable failure detector is a mechanism that provides (possibly incorrect) 
information about process failures. When it is queried, the failure detector re- 
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Eventual strong accuracy 


Eventual weak accuracy 


Strong completeness 


Eventually Perfect 

or 


Eventually Strong 
OS 


Weak completeness 


Eventually Quasi-Perfect 
OQ 


Eventually Weak 
OW 



Fig. 1. Four classes of failure detectors defined in terms of completeness and 
accuracy. 



turns a list of processes believed to have crashed (suspected processes). In [2], 
failure detectors were characterized in terms of two properties: completeness and 
accuracy. Completeness characterizes the failure detector capability of suspecting 
every incorrect process (processes that actually crash) while accuracy character- 
izes the failure detector capability of not suspecting correct processes. Two kinds 
of completeness and four kinds of accuracy were defined, which combined yield 
eight classes of failure detectors. 

In this paper we will focus on the two kinds of completeness and two of the 
four kinds of accuracy defined in [2] , which are the following: 

— Strong completeness. Eventually, every process that crashes is permanently 
suspected by every correct process. 

— Weak completeness. Eventually, every process that crashes is permanently 
suspected by some correct process. 

Note that completeness by itself is not very useful. We can trivially satisfy strong 
completeness by forcing every process to permanently suspect every other process 
in the system. 

— Eventual strong accuracy. Eventually, no correct process is ever suspected 
by any correct process. 

— Eventual weak accuracy. Eventually, some correct process is never suspected 
by any correct process. 

Combining in pairs these completeness and accuracy properties, we obtain four 
different failure detector classes, which are shown in Eig. 1. Out of these, Chandra 
et al. [3] showed that OW is the weakest class of failure detectors required for 
solving Consensus. 

Chandra and Toueg [2] proposed a timeout-based implementation of a 
<>V failure detector in a system with partial synchrony (they recognize that, 
in practice, some synchrony is required to implement the failure detectors they 
propose). In their algorithm, all processes periodically send a message to every 
other process in order to inform them that it has not crashed. If there are n 
processes in the system and C of them do not crash, at least nC messages are 
periodically exchanged with this algorithm. We do not know of any other imple- 
mentation of these classes of failure detectors. 
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1.3 Our Results 

In this paper, we present a family of algorithms that implement unreliable fail- 
ure detectors of the four classes defined in the previous section, in partially 
synchronous systems. Our algorithms have been designed, and are presented, in 
a gradual way. First, we present an algorithm that provides weak completeness. 
Next, we show how to extend this algorithm to provide eventual weak accuracy. 
This extended algorithm implements a OW failure detector. Next, we present 
two other extensions which strengthen the accuracy and the completeness, re- 
spectively, implementing the stronger failure detectors. 

In all these algorithms, each correct process monitors only one other pro- 
cess in a cyclic fashion. The monitoring process performs this task by repeat- 
edly polling the monitored process. Each polling involves only two messages 
exchanged between the monitoring and monitored processes. If the pollings were 
done periodically, a total of no more than 2n messages would be periodically 
exchanged. Eventually, this amount becomes at most 2C, which is a significant 
improvement over the at least nC messages of the previous algorithm (Chandra 
and Toueg’s). 

The rest of the paper is organized as follows. The next section describes our 
model of distributed system. In Section 3, we present a basic algorithm that 
provides weak completeness. In Section 4, we present an extension to the basic 
algorithm that provides eventual weak accuracy. In Section 5, we present another 
extension that provides eventual strong accuracy. In Section 6, we present an 
extension to the previous algorithms that provides strong completeness, while 
preserving accuracy. In Section 7, we study the performance of our algorithms 
in terms of the number and the size of the messages periodically exchanged. 
Finally, Section 8 summarizes the conclusions and presents future lines of work. 

2 The Model 

2.1 System Model 

Our model of distributed system consists of a set U of n processes, 71 = 
{pi, . . . ,Pn}, that communicate by exchanging messages. Every pair of processes 
is assumed to be connected by a reliable communication channel. 

Processes can fail by crashing^ that is, by prematurely halting. Crashed pro- 
cesses do not recover. In every execution of the system we identify two comple- 
mentary subsets of 71: the subset of processes that do not fail, denoted correct, 
and the subset of processes that do fail, denoted crashed. We use C to denote 
the number of correct processes in the system, which we assume is at least one, 
i.e., C = |correct| > 0. For every process p in crashed we use Tcrashp to denote 
the instant at which p crashes. 

In the algorithms presented in this paper we consider the processes pi, ... ,Pn 
arranged in a logical ring. This arrangement is known by all the processes. With- 
out loss of generality, process pi is followed in the ring by process ^od n)+i- 
general, we use succ{p) to denote the process that follows process p in the ring. 
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and pred{p) to denote the process that precedes process p in the ring. Finally, we 
use corrsucc{p) and corr-pred{p) to denote the closest correct (i.e., belonging 
to the subset correct) successor and predecessor of p in the ring, respectively. 

We consider the model of partial synchrony proposed by Chandra and 
Toueg [2] . In this model, there are bounds on both message delivery time and pro- 
cessing time, but these bounds are not known and only hold after an unknown, 
but finite, stabilization interval. We shall use Tg to denote the ending instant of 
this stabilization interval in the execution of interest. We also denote by Amsg 
the maximum time interval, after stabilization, since a process sends a message 
and that message is delivered and processed by its destination process (assuming 
that both the sender and the destination have not failed). Clearly, Amsg depends 
on the existing bounds on both message delivery time and processing time. Note 
that the exact value of Amsg exists, but it is unknown. 

2.2 Implementation of Failure Detectors 

A distributed failure deteetor can be viewed as a set of n failure detection mod- 
ules, each one attached to a different process in the system. These modules 
cooperate to satisfy the required properties of the failure detector. Each module 
maintains a list of the processes it suspects to have crashed. These lists can differ 
from one module to another at a given time. We denote by Lp {Gp in Section 6) 
the list of suspected processes of the failure detection module attached to pro- 
cess p. Clearly, the contents of the list Lp {Gp) can be different at different times. 
We use Lp{t) (Gp{t) in Section 6) to denote the contents of Lp (Gp) at time t. 
A process p interacts only with its local failure detection module in order to get 
the current list of suspected processes. 

In this paper, we only describe the behavior of the failure detection modules 
in order to implement a failure detector, but not the behavior of the processes 
they are attached to. For this reason, in the rest of the paper we will mostly use 
the term proeess instead of failure deteetion module. We consider that a process 
cannot crash independently of its attached failure detection module. 

In any algorithm that implements any of the failure detector classes defined 
in Section 1.2, it is required that some processes monitor other processes. Mon- 
itoring allows a process to detect whether another process has crashed and to 
take proper action if so (usually suspect it). Clearly, there are several possible 
ways to implement the monitoring. Examples are the monitored process send- 
ing an I- AM- ALIVE message (a heartbeat) to the monitoring process or the later 
polling the former for such a message. In any case, the only way a process can 
show it has not crashed is by sending messages to those monitoring it. Hence, 
any monitoring protocol requires that the monitored process sends messages to 
the monitoring process. 

Our algorithms use pollings instead of only sending heartbeats, because the 
former allow a finer control of the monitoring. To monitor process a process p 
sends an Are-you-ALIVE? message to q and waits for an I- AM- alive message 
from it. As soon as q receives the Are- YOU- alive? message, it sends the I-AM- 
ALIVE message to p. We will denote by Artt = ‘^Amsg the maximum monitoring 
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round-trip time after stabilization, i.e., the maximum time, after Tg, elapsed 
between the sending of an Are-you-ALIVE? message to a correct process, and 
the reception and processing of the corresponding I- AM- ALIVE reply message. 

Since a monitoring process p does not know it has to use an estimated 
value (timeout) that tells how much time it has to wait for the reply from the 
monitored process q. This time value is denoted by Then, if after Ap^q 

time p did not receive the reply from it suspects that q has crashed. We need 
to allow these time values to vary over time in our algorithms. We use Ap^q{t) 
to denote the value of Ap^q at time t. 

3 A Basic Algorithm that Provides Weak Completeness 

In this section, we present an algorithm that will be used as a framework for all 
the failure detector implementations presented in this paper. This first algorithm 
satisfies the weak completeness property. In the following sections we will extend 
the algorithm to satisfy also eventual weak accuracy, eventual strong accuracy, 
and strong completeness. This algorithm is presented here for the sake of clarity 
but is not very useful by itself, since it does not satisfy any of the accuracy 
properties previously defined. 

The algorithm executes as follows: initially, every process starts monitoring 
its successor in the ring. If a process p does not receive the reply from the pro- 
cess q it is monitoring, then p suspects that q has crashed, and starts monitoring 
the successor of q in the ring. This monitoring scheme is repeated, so that p 
always suspects all processes in the ring between itself and the process it is 
monitoring (not included). If, later on, p receives a message from a suspected 
process q while it is monitoring another process r, then p stops suspecting q and 
all the processes between q and r in the ring, and starts monitoring q again. 

Fig. 2 presents the algorithm in detail. Each process p has a variable targetp 
which holds the process being monitored by p at a given time. As we said above, 
all processes between p and targetp in the ring (and only them) are suspected 
by p, and these are the only processes included in the list Lp of suspected pro- 
cesses of p. (Initially, no process is suspected, i.e., Vp : Lp{0) = 0.) The mutexp 
variable is used to avoid race conditions in process p. 

We now show that weak completeness holds with this algorithm. Given an 
incorrect process p, the following theorem states that it will be permanently 
suspected by corr -pred{p) (the first correct process preceding p in the ring). 

Theorem 1. 3to : Vp G crashed^ p has failed at time to and Vt > to,p G 

^corr-pr ed(p) {t)‘ 

Proof. Let p be a process that crashes. We claim that p will be permanently 
included in L^orr-pred^p)' The proof uses strong induction on the distance from 
eorr -pred{p) to p. Let first consider that such distance is 1, i.e., eorr -pred{p) = 
pred{p). Before p fails, eorr -pred{p) andp exchange Are-you- alive? and I-AM- 
ALIVE messages (see Fig. 2). Eventually p crashes, and there is an Are-you- 
ALIVE? message sent by eorr.pred{p) that reaches p after Terashp. Since p has 
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Every process p executes: 



targetp succ(p) 

\/q E n \ Ap^q ^ default timeout 

cobegin 

II Task 1: 

loop 

wait(mutea:p) 

send Are-you-alive? to targetp 

tout Ap^targetp 

received ^ false 
signa\(mutexp) 
delay tout 
wait (mutex p) 
if not received 

Lp ^ Lp U {targetp} 
targetp ^ succ(targetp) 
end if 

signa\(mutexp) 

end loop 



II Task 2: 

loop 

receive message m from a process q 
wait(mutexp) 

case 

m = Are-you-alive?: 
send I-AM-ALIVE to q 
if q e Lp 

Lp ^ Lp - [q, .. . ,pred(targetp)} 
targetp ^ q 
received ^ true 

end if 

m — I-AM-ALIVE: 
case 

q = targetp'. 

received ^ true 
q G Lp'. 

Lp ^ Lp - {q, . . . ,pred(targetp)} 
targetp ^ q 
received ^ true 
else discard m 
end case 
end case 
signal (mRteXp) 
end loop 
coend 



Fig. 2. Algorithm that provides weak completeness. 



already crashed by then, it will never reply to that message. If such a message 
was sent at time t', then ^corr_pred(p),p(^0 later, corr-pred(p) will include p 
in Lcorr-pred{p)' Since no message will ever be received by corr.pred(p) from p 
after that, it will never be removed from L^orr-predip)' 

We will now prove that if the claim holds for any distance 1 < < 

i — 1, it also holds for distance i. Let us assume the distance from corr jpred(p) 
to p be 2 > 1. Then, for any process q G {succ(corr -pred(p)) ^ . . . ,pred(p)}, it 
can be easily seen that corr jpred(q) = corr -pred(p) and the distance d from 
corr-pred(p) to q verifies 1 < d < i — 1. Hence, from the induction hypothesis, 
all processes in {succ(corr-pred(p ))^ . . . ^pred(p)} will eventually be permanently 
in Lcorr-pred{p)' After that, they will never be monitored again by corr jpred(p) . 
The situation then is similar to the distance- 1 case considered above and, by a 
similar argument, p will eventually be permanently included in L^orr-pred^p)- 



Corollary 1. The algorithm of Fig. 2 provides weak eompleteness. 
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4 Extending the Basic Algorithm to Provide Eventual 
Weak Accuracy 

The algorithm presented in the previous section does not satisfy any of the 
accuracy properties defined in Section 1.2. It does not prevent the erroneous 
suspicion of any correct process, and these incorrect suspicions, although not 
permanent (if the suspected process is correct, the reply message will eventually 
be received), can happen infinitely often. This is due to the fact that the message 
delivery time could be greater than the fixed default timeout (see Fig. 2). In 
order to provide some useful accuracy, the timeout values must be augmented 
when processes are aware of having erroneously suspected a correct process. In 
this section, we present an extension to the basic algorithm of Fig. 2, based 
on augmenting the timeout values, which satisfies the eventual weak accuracy 
property. 

Eventual weak accuracy requires that, eventually, some correct process is 
never suspected by any correct process. In order to provide it, it is enough that 
this is satisfied for only one correct process. Our extension to the basic algorithm 
guarantees the existence of such a process, which we denote leader. Clearly, if we 
knew beforehand a correct process, eventual weak accuracy could be obtained 
by making all processes augment their timeout value with respect to this process 
each time they suspect it. This correct process would be leader. But since we 
cannot know in advance the correctness of any process, we need to devise another 
way to eventually have a correct and not-suspected process. 

In our extension of the algorithm of Fig. 2, processes behave as follows. 
Initially, every process will consider a pre-agreed process (e.g. pi) as an initial 
eandidate to be leader. When a process that monitors this candidate suspects 
it, it considers its successor in the ring as new candidate and monitors it. This 
scheme is repeated every time the current candidate is suspected. (Note that 
a process not monitoring a candidate cannot suspect it.) If a process p stops 
suspecting a process q, previously considered as candidate, then p will augment 
its timeout value Ap^q ^ . If the previously suspected process q was not considered 
as candidate, then p will not change Ap^q. This way, leader will be the first 
correct process in the ring starting from the initial eandidate (inclusive). All 
processes monitoring it will eventually stop suspecting it, and processes that do 
not monitor it will never suspect it. This gives us the eventually weak accuracy 
property. Fig. 3 presents the extended algorithm in detail. 

We now show that eventual weak accuracy holds with this algorithm, i.e., 
eventually some correct process is never suspected by any correct process. 

Lemma 1. After Tg, any eorreet proeess p will suspeet leader for no more 
than Artt time, eaeh time it does. 

Proof. Remember that, after T^, Artt is a bound on the monitoring round-trip 
time. A correct process p suspects leader after sending an Are-you-ALIVE? mes- 
sage to it at time t and not receiving an I- AM- alive reply message in Apyeader{t) 

^ For simplicity of the algorithm, instead of increasing timeouts when we stop sus- 
pecting, we increase timeouts as soon as we suspect a candidate. 
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Every process p executes: 



initial -candp ^ pre-agreed process 
targetp ^ succ{p) 

Lp^^ 

\/q E n \ Ap^q ^ default timeout 

cobegin 

II Task 1: 

loop 

wait(mi^texp) 

send Are-you-alive? to targetp 

tout Ap^targetp 

received ^ false 
signa\(mutexp) 
delay tout 



wait (mut ex p) 
if not received 

if initial -candp G {succ{p )^ . . . , targetp} 
Ap^targetp Ap^targetp T 1 
Lp Lp U ftargetp} 
targetp ^ succ{targetp) 

end if 

signal(mutexp) 

end loop 



II Task 2: 

coend 



{Same as algorithm in Fig. 2} 



Fig. 3. Extension to the algorithm of Fig. 2 to provide eventual weak accuracy. 



time. Since, by definition, leader is a correct process, the I-AM-ALIVE message 
will arrive at most at time t + Aptt {Ts + Aptt if t < Tg). At this moment leader 
is removed from Lp, the list of suspected processes of p. 

Lemma 2. Any correct process p will suspect leader a finite number of times. 

Proof. Let p be some correct process. Since the value of Tg is finite, p suspects 
leader a finite number of times before Tg. After that, from the algorithm, each 
time p suspects leader, the value of Ap^eader is incremented by one. After sus- 
pecting leader a finite number of times, Ap^eader will be greater than After 
this moment, p never suspects leader anymore. 

Theorem 2. : Vp G correct, ^t > ti, leader ^ Lp{t) 

Proof. Let t^ be the instant at which a correct process p stops suspecting leader 
for the last time. (If p never suspects leader, t\ = 0.) Such an instant exists 
from Lemma 1 and Lemma 2. Then, after instant ti = max {t^} no correct 

p£correct 

process p has leader in its list Lp. 

Corollary 2. The algorithm of Fig. 3 provides eventual weak accuracy. 

Observation 1 The only difference between this algorithm and the algorithm of 
Fig. 2 is that in the former the values of Ap^q can change. Clearly, this does not 
affect the proof of Theorem 1. Hence, Corollary 1 also applies to this algorithm. 

Corollary 3. The algorithm of Fig. 3 implements a failure detector of class 

OW. 

Proof. Follows from Corollary 2, Observation 1, and Corollary 1. 
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Every process p executes: 



targetp succ(p) 

\/q E n \ Ap^q ^ default timeout 



wait (mut ex p) 
if not received 




cobegin 

II Task 1: 



loop 

wait {mutex p) 

send Are-you-alive? to targetp 

tout Ap^targetp 

received ^ false 
signa\{mutexp) 
delay tout 



targetp ^ succ{targetp) 

end if 

signa\{mutexp) 



end loop 



II Task 2: 



{Same as algorithm in Fig. 2} 



coend 



Fig. 4. Extension to the algorithm of Fig. 2 to provide eventual strong accuracy. 

5 Extending the Basic Algorithm to Provide Eventual 
Strong Accuracy 

Eventual strong accuracy requires that, eventually, no correct process is ever 
suspected by any correct process. In this section, we propose another extension 
to the basic algorithm of Section 3 which satisfies this property. Broadly, the 
extension consists in each process augmenting its timeout values with respect 
to all processes it incorrectly suspects. This way every process will augment the 
timeout value with respect to its closest correct successor in the ring, and will 
thus eventually stop suspecting it (and hence, any other correct process). This 
gives us the eventually strong accuracy property. Fig. 4 presents the extended 
algorithm in detail. 

We now show that eventual strong accuracy holds with the algorithm in 
Fig. 4. We start with two lemmas, whose proofs are similar to those of Lemma 1 
and Lemma 2, respectively, and are omitted. 

Lemma 3. After Tg, any correct process p will suspect corr succ{p) for no more 
than Aptt time, each time it does. 

Lemma 4. Any correct process p will suspect corrsucc{p) a finite number of 
times. 

Theorem 3. 3^2 : Vp G correct, "iq G correct, ^t > t 2 ^q ^ Lp{t) 

Proof. Let be the instant at which a correct process p stops suspecting 
corrsucc{p) for the last time. {If p never suspects corr succ{p) , t^ = 0.) Such an 
instant exists from Lemma 3 and Lemma 4. Then, after instant t 2 = max {t^} 
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no correct process p has corrsucc{p) in its list Lp. Then, after ^ 2 , each correct 
process p only suspects processes in succ{p), . . . ^ pred{corrsucc{p)) , which are 
not correct by the definition of corrsucc{p). Therefore, no correct process q is 
in Lp after ^ 2 - 



Corollary 4. The algorithm of Fig. 4 provides eventual strong aeeuraey. 

Note that Observation 1 still applies to this algorithm. Hence, the following 
corollary, that follows from Corollary 4, Observation 1, and Corollary 1. 

Corollary 5. The algorithm of Fig. 4 implements a failure deteetor of elass 
OQ. 

6 Extending the Previous Algorithms to Provide Strong 
Completeness 

In this section we present an extension to the previous algorithms to provide 
strong completeness, while preserving accuracy. By combining this extension 
with the algorithms that implement failure detectors of classes OW and OQ, 
presented in previous sections, we obtain implementations of failure detectors of 
classes OS and 07^, respectively. 

Strong completeness requires that, eventually, every process that crashes is 
permanently suspected by every correct process. In [2], Chandra and Toueg 
presented a distributed algorithm that transforms weak completeness into strong 
completeness. Broadly, in their algorithm, every process periodically broadeasts 
(sends to every other process) its loeal list of suspected processes. Upon reception 
of these lists, each process builds a global list of suspected processes, which 
provides strong completeness. Clearly, in this algorithm each correct process 
periodically sends n messages, with the total number of messages exchanged 
being at least nC. 

In our extension, we follow a similar approach. Besides its local list Lp of 
suspected processes, each process p has a global list Gp of suspected processes. 
While Lp only holds the suspected processes between p and the process p is 
monitoring {targetp)^ Gp holds all the processes that are being suspected in the 
system. Now, the global lists are the ones providing strong completeness. 

In order to correctly build the global lists, processes need to propagate their 
local lists. However, instead of periodically broadcasting its local list, every pro- 
cess will only send its global list (which contains the local list) to the process it is 
monitoring. This process, upon reception of that list, updates its global list and 
further propagates it. Note that, since we use the ring arrangement of processes, 
each process at most sends and receives one message periodically, and the total 
number of messages exchanged is 0{n) in the worst case, which eventually be- 
comes 0(C). Furthermore, instead of using specific messages to send the global 
lists, we can piggyback the global lists in the Are-you-alive? messages inherent 
to the monitoring action. This way, there is no increment in message exchanges 
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Every process p executes: 

{if the algorithm needs it: 
initial -candp ^ pre-agreed process} 
targetp ^ succ{p) 

Lp^d) 

Gp^d) 

Wq E n : Ap^q ^ default timeout 

cobegin 

II Task 1: 

loop 

wait (mutex p) 
send Are-you-alive? 

— with Gp — to targetp 

tout Ap^targetp 

reeeived ^ false 
signa\{mutexp) 
delay tout 
wait (mutex p) 
if not reeeived 

{Update Ap^targetp if required} 
Gp ^ Gp U {targetp} 

Lp ^ Lp U {targetp} 
targetp ^ suee(targetp) 

end if 

signal(mRteXp) 

end loop 



II Task 2: 

loop 

receive message m from a process q 
wait (mut ex p) 

case 

m — Are-you-alive? — with Gq — : 
send I-AM-ALIVE to q 
if q e Lp 

Lp ^ Lp - {q, . . . ,pred(targetp)} 
targetp ^ q 
reeeived ^ true 

end if 

Gp < Gq U Lp — {p, q} 
m — I-AM-ALIVE: 

case 

q = targetp'. 

reeeived ^ true 
q ^ Lp'. 

Lp ^ Lp - {q, . . . ,pred(targetp)} 
Gp ^ Gp — {g} 
targetp ^ q 
reeeived ^ true 
else discard m 
end case 
end case 
signal(mRteXp) 
end loop 
coend 



Fig. 5. Extension to the previous algorithms to provide strong completeness. 



from the previous algorithms. Fig. 5 presents the extended algorithm in detail. 
We now show that strong completeness holds, while accuracy is preserved, with 
this algorithm. 

Observation 2 The only difference between this algorithm and the previous 
ones is the handling of the global lists of suspected processes Gp, while the local 
lists Lp are handled as before. Hence, Theorem 1 and whichever corresponds of 
Theorem 2 and Theorem 3 are still applicable to this algorithm. 



Observation 3 Vp G n,\/t,Lp(t) C Gp(t). 



Observation 4 Vp ^ correct f/t, p will eventually receive Are-you-ALIVE? mes- 
sages after t. 
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Lemma 5. 3ts : \/q G crashed^^p G correct, \/t >ts,qe Gp{t). 

Proof. Let us assume we are at least at instant to as defined in Theorem 1. We 
know that at this instant any process q G crashed has already failed and has 
been permanently included in Lcorr.pred{q)' 

Let us assume now that we have a process q G crashed and a process p G 
correct. We claim that q will eventually be permanently included in Gp. We use 
strong induction on the number of correct processes in the set 
{corr-pred{q), . . . ,p}. For the base case we assume there is only one correct 
process in the set, i.e., p = corr -pred{q) . Hence, from Theorem 1, q is perma- 
nently in Lp and, from Observation 3, q will be permanently in Gp in this case. 

We will now prove that, if the claim holds for any number l<c<i — lof 
correct processes in the set {corr jpred{q ) , . . . ,p}, it also holds when the num- 
ber of correct processes in the set is i. To do so, we show first that there is a 
time t' after which p receives Are-you-ALIVE? messages and all of them carry 
global lists containing q. From that, it is immediate to see in the algorithm 
that, after receiving the first such Are- YOU- ALIVE? message, q will be perma- 
nently included in Gp. Let us assume the number of correct processes in the 
set {corr jpred{q ) , . . . ,p} be i > 1. By induction hypothesis, there is a time t" 
at which any correct process r G {corrjpred{q), . . . ,corrjpred{p)} permanently 
contains q in its global list Also, there is a time t' = max(t",Ts) + A^sg at 
which all the Are-you-ALIVE? messages sent to p before t" have been received. 
From Observation 4, process p will receive new Are-you-alive? messages af- 
ter t' . Let be an Are-you-alive? message received by p from a process 5 at a 
time t > t' . There are two cases to consider: 

— s G {corrjpred{q), . . . ,corrjpred{p)}. In this case, from the induction hy- 
pothesis and the definition of t' , we know that the global list Gs carried by 
the Are-you-alive? message contains q. 

— s G {p, . . . ,corrjpred{corr-pred{q))}. In this case, it can be seen from the 
algorithm that if p receives an Are-you-alive? message from s, then nec- 
essarily, at the time of sending the message, p = targets and Lg contained q. 
Therefore, from Observation 3, the Gg carried by the Are-you-alive? mes- 
sage contains q. 

The following lemma states that the algorithm of Fig. 5 preserves eventual 
accuracy. 

Lemma 6. Let p he any eorreet proeess. If there is a time after whieh no eorreet 
proeess q eontains p in Lq, then there is a time after whieh no eorreet proeess q 
eontains p in Gq. 

Proof. Let us assume we are at least at instant to as defined in Theorem 1. We 
know that at this instant any process in crashed has already failed. Let p be a 
correct process and t'" > to be an instant such that Vt > t'" ,\/q G correct, p ^ Lq. 

Let us assume now that we have a process q G correct. We claim that there 
is a time after which p is never in Gq. We use strong induction on the number 
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of correct processes in the set {p, . . . , g'}. For the base case, we assume there is 
only one correct process in the set, i.e., p = q. It is easy to observe from the 
algorithm that p will never include itself in Gp. 

We will now prove that, if the claim holds for any number l<c<i — lof 
correct processes in the set {p, • • • , it also holds when the number of correct 
processes in the set is i. To do so, we show first that there is a time t' after 
which q receives Are-you-alive? messages and all of them carry global lists 
not containing p. From that, it is immediate to see in the algorithm that, after 
receiving the first such Are-you-ALIVE? message, p will be removed (if needed) 
and never included again in Gq. Let us assume the number of correct processes in 
the set {p, . . . , g} be i > 1. By induction hypothesis, there is a time t" after which 
any correct process r G {p, • . • , corr-pred(q)} does not contain p in its global 
list Gr- Also, there is a time t' = max(t", T^) Amsg at which all the Are-you- 
alive? messages sent to q before t" have been received. From Observation 4, 
process q will receive new Are-you-alive? messages after t' . Let be an Are- 
you-alive? message received by q from a process 5 at a time t > t' . There are 
two cases to consider: 

— s G {p, . . . ^corr-pred(q)}. In this case, from the induction hypothesis and 
the definition of t', we know that the global list Gs carried by the Are-you- 
alive? message does not contain p. 

— 5 G {g, . . . , corr-pred{p)}. This case cannot happen, because it would imply 
that p is in the local list Lg. 

Combining both lemmas it is immediate to derive the following theorem. 

Theorem 4. The algorithm of Fig. 5 provides strong eompleteness while pre- 
serving aeeuraey. 



Corollary 6. The algorithm of Fig. 5, eombined with the algorithm of Fig. 3 or 
Fig. 4, implements failure deteetors of elasses OS and OV , respeetively. 

7 Performance Analysis 

In this section, we will evaluate the performance of the presented algorithms in 
terms of the number and size of the exchanged messages. Observe that failure 
detection is an on-going activity that inherently requires an infinite number 
of messages. Furthermore, the pattern of message exchange between processes 
can vary over time (and need not be periodic), and different algorithms can 
have completely different patterns. For these reasons, we have to make some 
assumptions in order to use the number of messages as a meaningful performance 
measure. We will first assume that the algorithms execute in a periodic fashion, 
so that we can count the number of messages exchanged in a period. Secondly, to 
be able to compare the number of messages exchanged by different algorithms, 
we must assume that their respective periods have the same length. 



48 



Mikel Larrea et al. 



Under the above assumptions, in our algorithms each correct process period- 
ically polls only one other process. Each polling involves two messages. Thus, a 
total of no more than 2n messages would be periodically exchanged. Eventually, 
this amount becomes 2C, since there will be only C correct processes remaining 
in the system. This compares favorably with Chandra and Toueg’s algorithm, 
which requires a periodic exchange of at least nC messages. 

Concerning the size of the messages, our algorithms implementing failure 
detectors with weak completeness (OW and OQ) require messages of 0(logn) 
bits (to identify the sender). On the other hand, the algorithms implementing 
failure detectors with strong completeness {OS and OV) require messages of 
0{n) bits, since we can code the global list Gp of suspected processes in n bits 
(one bit per process). 

Chandra and Toueg’s algorithm, which only implements OV^ requires mes- 
sages of 0(logn) bits. This size is smaller than the size needed by our OV 
algorithm. However, the total amount of information periodically exchanged in 
our algorithm is 0{n^) bits, while in theirs it is 0{n^\ogn) bits. Eurthermore, 
each message that is sent involves a fixed overhead. In this sense, our algorithm 
presents an edge, since it involves less messages. 

8 Conclusions and Future Work 

In this paper we have proposed several algorithms to implement failure detectors 
of classes OW, OQ, OS and OV. These algorithms are efficient alternatives to 
the algorithm implementing OV proposed by Chandra and Toueg [2] . 

Apparently, the time to propagate the failure information in our algorithms 
is larger than the time in Chandra and Toueg’s algorithm. We need to further 
study this performance parameter both theoretically and empirically. 

As pointed out in the previous section, the number of messages periodically 
exchanged is not a general enough performance measure, since algorithms need 
not be periodic. We are studying new ways of evaluating the performance of this 
kind of algorithms. 
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Abstract. This paper addresses the Couseusus problem iu asyuchrouous 
distributed systems (made of n processes, at most / of them may crash) 
equipped with uureliable failure detectors. A generic Couseusus proto- 
col is preseuted: it is quorum-based aud works with auy failure detector 
belougiug to the class S (provided that / < n — 1) or to the class OS 
(provided that / < n/2). This quorum-based geueric approach for solv- 
iug the Couseusus problem is uew (to our kuowledge). Moreover, the 
proposed protocol is couceptually simple, allows early decisiou aud uses 
messages shorter thau previous solutious. 

The geueric dimeusiou aud the surprisiug desigu simplicity of the pro- 
posed protocol provide a better uuderstaudiug of the basic algorithmic 
structures aud priuciples that allow to solve the Couseusus problem with 
the help of uureliable failure detectors. 

Keywords: Asyuchrouous Distributed System, Couseusus, Crash Fail- 
ure, Perpetual/Eveutual Accuracy, Quorum, Uureliable Failure Detector. 



1 Introduction 

The Consensus problem is now recognized as being one of the most important 
problems to solve when one has to design or to implement reliable applications on 
top of an unreliable asynchronous distributed system. Informally, the Consensus 
problem is defined in the following way. Each process proposes a value, and all 
non-crashed processes have to agree on a common value which has to be one of 
the proposed values. The Consensus problem is actually a fundamental problem. 
This is because the most important practical agreement problems (e.^.. Atomic 
Broadcast, Atomic Multicast, Weak Atomic Commitment) can be reduced to it 
(see, for example, [2], [5] and [6] for each of the previous problems, respectively). 
The Consensus problem can be seen as their greatest eommon denominator . 

Solving the Consensus problem in an asynchronous distributed system where 
processes can crash is far from being a trivial task. More precisely, it has been 
shown by Fischer, Lynch and Paterson [4] that there is no deterministic solution 
to the Consensus problem in those systems as soon as processes (even only one) 
may crash. The intuition that underlies this impossibility result lies in the in- 
herent difficulty of safely distinguishing a crashed process from a “slow” process. 
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or from a process with which communications are “very slow”. This result has 
challenged and motivated researchers to find a set of minimal properties that, 
when satisfied by the runs of a distributed system, allows to solve Consensus 
despite process crashes. 

The major advance to circumvent the previous impossibility result is due to 
Chandra and Toueg who have introduced [2] (and studied with Hadzilacos [3]) 
the Unreliable Failure Deteetor concept. A failure detector can be seen as a set of 
modules, each associated with a process. The failure detector module attached to 
a process provides it with a list of processes it suspects to have crashed. A failure 
detector module can make mistakes by not suspecting a crashed process or by 
erroneously suspecting a correct process. In their seminal paper [2] Chandra and 
Toueg have introduced several classes of failure detectors. A class is defined by 
a Completeness property and an Aeeuraey property. A completeness property is 
on the actual detection of crashes. The aim of an accuracy property is to restrict 
erroneous suspicions. Moreover an accuracy property is Perpetual if it has to be 
permanently satisfied. It is Eventual if it is allowed to be permanently satisfied 
only after some time. 

In this paper, we are interested in solving the Consensus problem in asyn- 
chronous distributed systems equipped with a failure detector of the class S or 
with a failure detector of the class OS. Both classes are characterized by the 
same completeness property, namely, “Eventually, every crashed process is sus- 
pected by every correct process” . They are also characterized by the same basic 
accuracy property, namely, “There is a correct process that is never suspected” . 
But these two classes differ in the way (modality) their failure detectors satisfy 
this basic accuracy property. More precisely, the failure detectors of S perpet- 
ually satisfy the basic accuracy property, while the failure detectors of OS are 
allowed to satisfy it only eventually. 

Several Consensus protocols based on such failure detectors have been de- 
signed. Chandra and Toueg have proposed a Consensus protocol that works 
with any failure detector of the class S [2]. This protocol tolerates any number 
of process crashes. Several authors have proposed Consensus protocols based on 
failure detectors of the class OS: Chandra and Toueg [2], Schiper [9] and Hurfin 
and Raynal [8]. All these 0<S-based Consensus protocols require a majority of 
correct processes. It has been shown that this requirement is necessary [2]. So, 
these protocols are optimal with respect to the number of crashes they tolerate. 
Moreover, when we consider the classes of failure detectors that allow to solve 
the Consensus problem, it has ben shown that OS is the weakest one [3] . 

5-based Consensus protocols and OiS-based Consensus protocols are usually 
considered as defining two distinct families of failure detector-based Consensus 
protocols. This is motivated by (1) the fact that one family assumes perpetual 
accuracy while the other assumes only eventual accuracy, and (2) the fact that 
the protocols of each family (and even protocols of a same family) are based 
on different algorithmic principles. In this paper, we present a generic failure 
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detector-based Consensus protocol which has several interesting characteristics. 
The most important one is of course its “generic” dimension: it works with any 
failure detector of the class S (provided / < n) or with any failure detector of 
the class OS (provided / < n/2) (where n and / denote the total number of 
processes and the maximum number of processes that may crash, respectively). 
This protocol is based on a single algorithmic principle, whatever is the class 
of the underlying failure detector. Such a generic approach for solving the Con- 
sensus problem is new (to our knowledge). It has several advantages. It favors 
a better understanding of the basic algorithmic structures and principles that 
are needed to solve the Consensus problem with the help of a failure detector. It 
also provides a better insight into the “perpetual/eventual” attribute of the ac- 
curacy property, when using unreliable failure detectors to solve the Consensus 
problem. (So, it allows to provide a single proof, where the use of this attribute 
is perfectly identified.) Moreover, the algorithmic unity of the protocol is not ob- 
tained to the detriment of its efficiency. Last but not least, the design simplicity 
of the protocol is also one of its noteworthy properties. 

The paper is composed of seven sections. Section 2 presents the distributed 
system model and the failure detector concept. Section 3 defines the Consensus 
problem. The next two sections are devoted to the generic protocol: it is pre- 
sented in Section 4 and proved in Section 5. Section 6 discusses the protocol and 
compares it with previous failure detector-based Consensus protocols. Finally 
Section 7 concludes the paper. 

2 Asynchronous Distributed System Model 

The system model is patterned after the one described in [2,4]. A formal intro- 
duction to failure detectors is provided in [2,3]. 

2.1 Asynchronous Distributed System with Process Crash Failures 

We consider a system consisting of a finite set 77 of n > 1 processes, namely, 
n = {pi,P 2 , • • • ,7n}- A process can fail by crashing^ 7e., by prematurely halt- 
ing. It behaves correctly (7e., according to its specification) until it (possibly) 
crashes. By definition, a correct process is a process that does not crash. Let 
/ denote the maximum number of processes that can crash (/ < n — 1). Pro- 
cesses communicate and synchronize by sending and receiving messages through 
channels. Every pair of processes is connected by a channel. Channels are not 
required to be FIFO, they may also duplicate messages. They are only assumed 
to be reliable in the following sense: they do not create, alter or lose messages. 
This means that a message sent by a process pi to a process pj is assumed to 
be eventually received by pj^ if pj is correct^. The multiplicity of processes and 

^ The “no message loss” assumption is required to ensure the Termination property 
of the protocol. The “no creation and no alteration” assumptions are required to 
ensure its Validity and Agreement properties (see Sections 3 and 4). 
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the message-passing communication make the system distributed. There is no 
assumption about the relative speed of processes or the message transfer delays. 
This absence of timing assumptions makes the distributed system asynchronous. 

2.2 Unreliable Failure Detectors 

Informally, a failure detector consists of a set of modules, each one attached 
to a process: the module attached to pi maintains a set (named suspectedi) of 
processes it currently suspects to have crashed. Any failure detector module is 
inherently unreliable: it can make mistakes by not suspecting a crashed process or 
by erroneously suspecting a correct one. Moreover, suspicions are not necessarily 
stable: a process pj can be added to or removed from a set suspectedi according 
to whether p^’s failure detector module currently suspects pj or not. As in papers 
devoted to failure detectors, we say “process suspects process pj” at some time, 
if at that time we have pj G suspectedi • 

As indicated in the introduction, a failure detector class is formally defined 
by two abstract properties, namely a Completeness property and an Accuracy 
property. In this paper, we consider the following completeness property [2]: 

— Strong Completeness: Eventually, every process that crashes is permanently 
suspected by every correct process. 

Among the accuracy properties defined by Chandra and Toueg [2] we consider 
here the two following ones: 

— Perpetual Weak Accuracy: Some correct process is never suspected. 

— Eventual Weak Accuracy: There is a time after which some correct process is 
never suspected by correct processes. 

Combined with the completeness property, these accuracy properties define 
the following two classes of failure detectors [2] : 

— S'. The class of Strong failure detectors. This class contains all the failure 
detectors that satisfy the strong completeness property and the perpetual 
weak accuracy property. 

— OS: The class of Eventually Strong failure detectors. This class contains all 
the failure detectors that satisfy the strong completeness property and the 
eventual weak accuracy property. 

Clearly, S C OS. Moreover, it is important to note that any failure detector that 
belongs to S or to OS can make an arbitrary number of mistakes. 

3 The Consensus Problem 

3.1 Definition 

In the Consensus problem, every correct process pi proposes a value Vi and all 
correct processes have to decide on some value v, in relation to the set of proposed 
values. More precisely, the Consensus problem is defined by the following three 
properties [2,4]: 
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— Termination: Every correct process eventually decides on some value. 

— Validity: If a process decides then v was proposed by some process. 

— Agreement: No two correct processes decide differently. 

The agreement property applies only to correct processes. So, it is possible that 
a process decides on a distinct value just before crashing. Uniform Consensus 
prevents such a possibility. It has the same Termination and Validity properties 
plus the following agreement property: 

— Uniform Agreement: No two processes (correct or not) decide differently. 

In the following we are interested in the Uniform Consensus problem. 

3.2 Solving Consensus with Unreliable Failure Detectors 

The following important results are associated with the Consensus problem when 
one has to solve it in an asynchronous distributed system, prone to process crash 
failures, equipped with an unreliable failure detector. 

— In any distributed system equipped with a failure detector of the class 5, 
the Consensus problem can be solved whatever the number of crashes is [2] . 

— In any distributed system equipped with a failure detector of the class 05, 
at least a majority of processes has to be correct (z.e., / < n/2) for the 
Consensus problem to be solvable [2] . 

— When we consider the classes of failure detectors that allow to solve the 
Consensus problem, 05 is the weakest one [3]. This means that, as far as 
failure detection is concerned, the properties defined by 05 constitute the 
borderline beyond which the Consensus problem can not be solved^. 

— Any protocol solving the Consensus problem using an unreliable failure de- 
tector of the class 5 or 05, solves also the Uniform Consensus problem [6]. 

4 The General Consensus Protocol 

4.1 Underlying Principles 

The algorithmic principles that underly the protocol are relatively simple. The 
protocol shares some of them with other Consensus protocols [2,8,9]. Each pro- 
cess Pi manages a local variable esti which contains its current estimate of the 
decision value. Initially, esti is set to Vi^ the value proposed by pi. Processes 
proceed in consecutive asynchronous rounds. Each round r (initially, for each 
process Pi, Vi = 0) is managed by a predetermined process Pc (e.^., c can be de- 
fined according to the round robin order). So, the protocol uses the well-known 
rotating eoordinator paradigm^. 

^ The “weakest class” proof is actually on the class OW of failure detectors [3]. But, it 
has been shown that OW and 05, that differ in the statement of their completeness 
property, are actually equivalent: the protocol that transforms any failure detector 
of the class OW in a failure detector of the class 05 is based on a simple gossiping 
mechanism [2]. 

^ Due to the completeness property of the underlying failure detector, this paradigm 
can be used without compromising the protocol termination. More precisely, the 
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Description of a round A round (r) is basically made of two phases (communi- 
cation steps). 

First phase of a round. The current round coordinator Pc sends its current esti- 
mate (estc) to all processes. This phase terminates, for each process pi^ when pi 
has received an estimate from pc or when it suspects Pc- In addition to esti^ pi 
manages a local variable est-froru-Ci that contains either the value it has re- 
ceived from Pc, oi* the default value T. So, est-from-Ci = T means that pi 
has suspected Pc, and est-froru-Ci ^ T means that est-froru-Ci = estc- If we 
assumed that all non-crashed processes or none of them have received pfs es- 
timate and they all have the same perception of crashes, then they would get 
the same value in their est-from-Ci local variables. Consequently, they could all 
“synchronously” either decide (when est-froru-Ci ^ T) or proceed to the next 
round (when est-froru-Ci = T). 

Second phase of a round. Unfortunately, due to asynchrony and erroneous fail- 
ure suspicions, some processes pj can have est-froru-Cj = estc, while other 
processes pk can have est-froru-Ck = T at the end of the first phase. Actu- 
ally, the aim of the first phase was to ensure that V pp. est-froru-Ci = estc or 
T. The aim of the second phase is to ensure that the Agreement property will 
never be violated. This prevention is done in the following way: if a process pi 
decides v = estc during r and if a process pj progresses to r -1- 1, then pj does 
it with estj = v. This is implemented by the second phase that requires each 
process pi to broadcast the value of its est-froru-Ci local variable. A process pi 
finishes the second phase when it has received est-froru-C values from “enough” 
processes. The meaning of “enough” is captured by a set Qi, dynamically de- 
fined during each round. Let reci be the set of est-froru-C values received by pi 
from the processes of Qi. We have: reCi = {T} or {n} or {n, T} (where v is 
the estimate of the current coordinator). Let pj be another process (with its Qj 
and recj). If QiCiQj ^ 0, then there is a process Px G QiCiQj that has broadcast 
est-from-Cx and both pi and pj have received it. It follows that reCi and reCj 
are related in the following way: 

reCi = {n} ^ (V pj : {recj = {n}) V {recj = {n, T})) 

reci = {_L} ^ (V pj : {recj = {±}) V {recj = {v,-L})) 

reCi = {n, _L} ^ (V pj : {recj = {n}) V {recj = {T}) V {recj = {n, _L})) 

The behavior of pi is then determined by the content of rec^: 

— When reCi = {n}, pi knows that all non-crashed processes also know v. 
So, Pi is allowed to decide on v provided that all processes that do not 
decide consider v as their current estimate. 

— When reCi = {T}, pi knows that any set recj includes T. In that case, no 
process pj is allowed to decide and pi proceeds to the next round. 

— When reCi = {n, _L}, according to the previous items, pi updates its current 
estimate (esU) to v to achieve the Agreement property. Note that if a pro- 



completeness property can be exploited by a process to not indefinitely wait for a 
message from a crashed coordinator. 
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cess pj decides during this round, any process pi that proceeds to the next 
round, does it with esti = v. 

Definition of the Qi set As indicated previously, the definition of the Qi sets has 
to ensure that the predicate Qi Cl Q j 7^ 0 holds for every pair (pi^pj). The way 
this condition is realized depends on the class to which the underlying failure 
detector belongs. 

Let us first consider the case where the failure detector belongs to the class S. 
In that case, there is a correct process that is never suspected. Let Px be this 
process. If Qi contains Px^ then pi will obtain the value of est-froru-Cx. If follows 
that if (V Pi) Qi is such that II = Qi U suspectedi^ then, V (pi^pj)^ we have Px G 
Qi n Qj. 

Let us now consider the case where the failure detector belongs to the 
class OS. In that case, f < n/2 and there is a time after which some cor- 
rect process is no longer suspected. As we do not know the time from which 
a correct process is no longer suspected, we can only rely on the majority of 
correct processes assumption. So, by taking (V Pi) Qi equal to a majority set, it 
follows that, V ^Px such that, Px ^ Qi Qj- 

Note that in both cases, Qi is not statically defined. In each round, its actual 
value depends on message receptions and additionally, in the case of 5, on process 
suspicions. 

On the quorum-based approaeh The previous principles actually define a quorum- 
based approach. As usual, (1) each quorum must be live: it must include only 
non-crashed processes (this ensures processes will not block forever during a 
round). Furthermore, (2) each quorum must be safe: it must have a non-empty 
intersection with any other quorum (this ensures the agreement property cannot 
be violated). As indicated in the previous paragraph, the quorum safety require- 
ment is guaranteed by the “perpetual” modality of the accuracy property (for 5), 
and by the majority of correct processes assumption (for 05). 

Other combinations of eventual weak accuracy (to guarantee eventual termi- 
nation) and live and safe (possibly non- majority) quorums would work^. Bringing 
a quorum-based formulation to the fore is conceptually interesting. Indeed, the 
protocol presented in the next section works for any failure detector satisfying 
strong completeness, eventual weak accuracy and the “quorum” conditions. 

4.2 The Protocol 

The protocol is described in Figure 1. A process pi starts a Consensus execu- 
tion by invoking Consensus(n^). It terminates it when it executes the statement 
return which provides it with the decided value (lines 12 and 16). 

^ As an example, let us consider quorums defined from a ySl * yTfl (with n = 
q‘^). This would allow the protocol to progress despite n — (2 * ^/n — 1) crashes or 
erroneous suspicions in the most favorable case. Of course, in the worst case, the use 
of such quorums could block the protocol in presence of only ^/n crashes or erroneous 
suspicions. Details on quorum definition can be found in [1]. 
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It is possible that distinct processes do not decide during the same round. To 
prevent a process from blocking forever (z.e., waiting for a value from a process 
that has already decided), a process that decides, uses a reliable broadcast [7] 
to disseminate its decision value (similarly as protocols described in [2,8,9]). 
To this end the Consensus function is made of two tasks, namely, T1 and T2. 
T1 implements the previous discussion. Line 12 and T2 implement the reliable 
broadcast. 



Function Consensus(r»i) 
cobegin 

(1) task Tl: n ^ 0; esU Vi] % Vi ^ ± % 

(2) while true do 

(3) c ^ {ri mod n) + 1; est-frorri-Ci ^ T; ^ + 1; % round r = Vi % 

(4) case (i = c) then est-fromjCi ^ esti 

(5) {i ^ c) then wait ({EST{n,v) is received from pc)V(c G suspectedi)); 

(6) if (EST(ri,r») has been received) then est-frorri-Ci ^ v 

(7) endcase; % est-frorruci — estc or T % 

(8) yj do send EST{n,est-from-Ci) to pj enddo; 

(9) wait until (Vpj G Qp. EST(ri, est_/rom_c) has been received from pj); 

% Qi has to be a live and safe quorum % 

% For S: Qi is such that Qi U suspeetedi — II % 

% For OS: Qi is such that \Qi \ = |"(n + l)/2] % 

(10) let reei = {est-frorri-e \ EST(n, est_/rom_c) is received at line 5 or 9}; 

% est-fromjc = T or i’ with v — estc % 

% reei = {T} or {'c} or {v, T} % 

(11) case (reei = {T}) then skip 

(12) (reei = {r’}) then Vj ^ i do send DEClDE(a) to pj enddo; return(i’) 

(13) (reei — {v^ T}) then esti ^ V 

(14) endcase 

(15) enddo 



(16) task T2: upon reception of decide(i’): 

yj ^ i do send decide(i’) to pj enddo; return(i’) 



coend 



Fig. 1. The Consensus Protocol 



5 Correctness Proof 

5.1 Validity 

Lemma 1. Let us consider a round r and a process pi. The round r is coordi- 
nated by Pc. We have: 

(1) If Pc participates in round r, then estc 'is equal to an initial value proposed 
by a process. 
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(2) If Pi computes reci during round r, then: reci = {J_} or reci = {i;} or 
reci = {v, ±}, where v is equal to estc- Moreover, if v e reCi, Pc has participated 
in round r. 

(3) If Pi starts round r-\-l, it does it with an estimate (esti) whose value is equal 
to an initial value. 

Proof The proof is by induction on the round number. 

— Base case. Let us consider the round r = 1. It is coordinated by Pc = Pi, 

and estc is equal to Vc (Pc’s proposal, line 1). The local variable est-from-Cj 
of any process pj that (during this round) executes line 8, is equal either to 
estc (if Pj has received an estimate from pc -line 4- or if pj = pc -line 6-) or 
to T (if Pj has suspected Pc, line 5). So, if pj executes line 8, it broadcasts 
either the value of estc (= '^c) or T. It follows that any pi that computes 
reei during the first round, can only receive Vc or T at line 9. Consequently, 
we have: reei = {T} or reei = {'^c} or reei = T}. 

Now, let us first note that, initially, esti = (line I). Due to lines II-I3, if pi 

starts r -h I, it does it either with the value of esti loft unchanged (line II) 
or with esti = '^c (line 13). So, the lemma is true for r = I. 

— Assume the lemma is true until r, r > I. This means that if Pc (the round 

r + I coordinator) participates in r + I, then we had (at the end of r) 
reCc = {T} or reCc = {'T’} or reCc = {'T’, T}, where v is an initial value. 
Due to the induction assumption and to the case statement (lines 1 1- 14) 
executed by Pc at the end of r, it follows that Pc starts r -b I with estc equal 
to an initial value proposed by a process. Now, the situation is similar to the 
one of the base case, and consequently, the same argument applies to the 
round r + I case, which proves the lemma. 



^ Lemma 1 

Theorem 1. If a process pi decides v, then v was proposed by some process. 

Proof If a process decides at line 16, it decides the value decided by another 
process at line 12. So we only consider the case where a value that has been 
decided at line 12. When a process pi decides v at line 12, it decides on the value 
(t^ T) of the reei singleton. Due to the items (I) and (2) of Lemma I, 'i; is an 
initial value of a process. ^Theorem i 



5.2 Termination 

Lemma 2. If no process decides during r' < r, then all correct processes start 
r + I. 

Proof The proof is by contradiction. Suppose that no process has decided 
during a round r' < r, where r is the smallest round number in which a correct 
process pi blocks forever. So, pi is blocked at line 4 or at line 9. 
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Let us first examine the case where pi blocks at line 4. Let Pc be the round r 
coordinator. If pi = Pc^ it cannot block at line 4, as it does not execute this line. 
Moreover, in that case, it executes the broadcast at line 8 or crashes. If pi Pc^ 
then: 

- Either pi suspects Pc- This is due to an erroneous suspicion or to the strong 
completeness property of the failure detector. 

- Or Pi never suspects Pc- Due to the strong completeness property, this means 
that Pc is correct. From the previous observation, pc has broadcast its current 
estimate at line 8, and pi eventually receives it. 

It follows that Pi cannot remain blocked at line 4. 

Let us now consider the case where pi blocks at line 9. For this line there are 
two cases to consider, according to the class of the underlying failure detector. 

— The failure detector belongs to S and / < n — 1. In that case, the set Qi of 
processes from which pi is waiting for messages is such that Qi\J suspect edi = 
71. As, no correct process blocks forever at line 4, each of them executes a 
broadcast at line 8. It follows from these broadcasts and from the strong 
completeness property that, V pj, pi will receive a round r estimate from pj 
or will suspect it. Consequently, pi cannot block at line 9. 

— The failure detector belongs to the class OS and / < n/2. In that case Qi is 
defined as the first majority set of processes pj from which pi has received a 
ESx(r, est-from.c) message. As there is a majority of correct processes, and 
as (due to the previous observation) they do not block forever at line 4, they 
broadcast a round r estimate message (line 8). It follows that any correct 
process receives a message from a majority set of processes. Consequently, pi 
cannot block at line 9. 

Finally, let us note that, due to the item (2) of Lemma 1, a correct process pi 
terminates correctly the execution of the case statement (lines 11-14). It fol- 
lows that if Pi does not decide, it proceeds to the next round. A contradiction. 

^ Lemma 2 



Theorem 2. If a process pi is correct^ then it decides. 

Proof If a (correct or not) process decides, then, due to the sending of DECIDE 
messages at line 12 or at line 14, any correct process will receive such a message 
and decide accordingly (line 14). 

So, suppose that no process decides. The proof is by contradiction. Due to 
the accuracy property of the underlying failure detector, there is a time t after 
which there is a correct process that is never suspected. Note that t = 0 if 
the failure detector belongs to 5, and t > 0 if it belongs to OS (assuming the 
protocol starts executing at time t = 0). 

Let Px be the correct process that is never suspected after t. Moreover, let r be 
the first round that starts after t and that is coordinated by Px- As by assumption 
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no process decides, due to Lemma 2, all the correct processes eventually start 
round r. 

The process Px starts round r by broadcasting its current estimate value 
(estx), which, due to Lemma 1, is equal to an initial value. Moreover, during r, Px 
is not suspected. Consequently, all processes pi that participate in round r (this 
set includes the correct processes) receive estx at line 4, and adopt it as their 
est-frorri-Ci value. If follows that no value different from estx can be broadcast 
at line 8; consequently, estx is the only value that can be received at line 9. 
Hence, for any correct process we have rcQ = {estx} at line 10. It follows 
that any correct process executes line 12 and decides accordingly. ^Theorem 2 

The following corollary follows from the proof of the previous theorem. 

Corollary 1. If the underlying failure detector belongs to the class S, the max- 
imum number of rounds is n. Moreover, there is no bound on the round number 
when the underlying failure detector belongs to the class OS. 

5.3 Uniform Agreement 

Lemma 3. If two processes pi and pj decide at line 12 during the same round, 
they decide the same value. 

Proof If both Pi and pj decide during the same round r, at line 12, we had 
reci = {'T'} and recj = {v'} at line 10. Moreover, from item (2) of Lemma 1, we 
have v = v' = estc (where estc is the value broadcast during r by its coordinator) . 

^ Lemma 3 



Theorem 3. No two processes decide differently. 

Proof Let r be the first round during which a process pi decides. It decides at 
line 12. Let v be the value decided by pi. Let us assume another process decides 
v' during a round r' >r. If r' = r, then due to Lemma 3, we have = i;'. So, let 
us consider the situation where r' > r. We show that the estimate values {estj) 
of all the processes pj that progress to r + 1 are equal to v. This means that no 
other value can be decided in a future round^. 

Let us consider any process pk that terminates the round r. Let us first note 
that there is a process Px such that Px G QiOQk- This follows from the following 
observation: 

- If the failure detector belongs to S, then by considering px, the correct process 
that is never suspected, we have Px ^ Qi^ Qk- 

- If the failure detector belongs to the class OS, as Qi and Qk are majority sets, 
we have QiHQk ^ 0 , and there is a Px such that Px ^ Qi ^ Qk- 

^ When we consider the terminology used in OtS-based protocols, this means the 
value v is locked. This proof shows that the value locking’^ principle is not bound to 
the particular use of OS. With S, a value is locked as soon as it has been forwarded 
by the (first) correct process that is never suspected. With OS, a value is locked as 
soon as it has been forwarded by a majority of processes. 
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As Pi has decided v at line 12 during r, we had during this round reci = {t’}. 
This means that pi has received v from all the processes of Qi, and so from px- 
Thus, pk has also received v from and consequently, reck = {t’} or reck = 
It follows that if pk proceeds to the next round, it executes line 13. 
Consequently, for all processes pj that progress to r + 1, we have estj = v. This 
means that, from round r + 1, all estimate values are equal to i;. As no value 
different from v is present in the system, the only value that can be decided in 

a round > r is V. ^Theorems 



6 Discussion 

6.1 Cost of the Protocol 

Time complexity of a round As indicated in Corollary 1, the number of rounds 
of the protocol is bounded by n, when used with a failure detector of the class S. 
There is no upper bound when it is used with a failure detector of the class OS. 
So, to analyze the time complexity of the protocol, we consider the length of 
the sequence of messages (number of communication steps) exchanged during a 
round. Moreover, as on one side we do not master the quality of service offered 
by failure detectors, but as on the other side, in practice failure detectors can be 
tuned to very seldom make mistakes, we do this analysis considering the under- 
lying failure detector behaves reliably. In such a context, the time complexity of 
a round is characterized by a pair of integers. Considering the most favorable 
scenario that allows to decide during the current round, the first integer mea- 
sures its number of communication steps. The second integer considers the case 
where a decision can not be obtained during the current round and measures the 
minimal number of communication steps required to progress to the next round. 
Let us consider these scenarios. 

— The first scenario is when the current round coordinator is correct and is 
not suspected. In that case, 2 communication steps are required to decide. 
During the first step, the current coordinator broadcasts its value (line 8). 
During the second step, each process forwards that value (line 8), waits 
for “enough” messages (line 9), and then decides (line 12). So, in the most 
favorable scenario that allows to decide during the current round, the round 
is made of two communication steps. 

— The second scenario is when the current round coordinator has crashed and 
is suspected by all processes. In that case, as processes correctly suspect 
the coordinator (line 5), they actually skip the first communication step. 
They directly exchange the T value (line 8) and proceed to the next round 
(line II). So, in the most favorable scenario to proceed to the next round, 
the round is made of a single communication step. 

So, when the underlying failure detector behaves reliably, according to the previ- 
ous discussion, the time complexity of a round is characterized by the pair (2,1) 
of communication steps. 
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Message eomplexity of a round During each round, each process sends a message 
to each process (including itself). Hence, the message complexity of a round is 
upper bounded by 

Message type and size There are two types of message: est and decide. A de- 
cide message carries only a proposed value. An est message carries a proposed 
value (or the default value T) plus a round number. The size of the round num- 
ber is bounded by log 2 (n) when the underlying failure detector belongs to S 
(Corollary 1 ). It is not bounded in the other case. 



6.2 Related Work 

Several failure detector-based Consensus protocols have been proposed in the 
literature. We compare here the proposed protocol (in short MR) with the fol- 
lowing protocols: 

- The 5-based Consensus protocol proposed in [ 2 ] (in short, CT^). 

- The OtS-based Consensus protocol proposed in [ 2 ] (in short, CTo^). 

- The 05-based Consensus protocol proposed in [9] (in short, SCo<s)- 

- The OtS-based Consensus protocol proposed in [ 8 ] (in short, HR 05 ). 

As MR, all these protocols proceed in consecutive asynchronous rounds. 
Moreover, all, but CT^, are based on the rotating coordinator paradigm. It 
is important to note that each of these protocols has been specifically designed 
for a special class of failure detectors (either S or OS). Differently from MR, 
none of them has a generic dimension. Let us also note that only MR and both 
CT protocols cope naturally with message duplications (z.e., they do not require 
additional statements to discard duplicate messages). 

Let V={ initial values proposed by processes } U {T}. Table 1 compares 
CT 5 and MR (when used with 5). Both protocols use messages during each 
round. A round is made of one or two communication steps in MR, and of a single 
communication step in CT^. The first column indicates the total number {k) of 
communication steps needed to reach a decision. For MR, this number depends 
on the parameter /. As indicated, CT^ does not allow early decision, while MR 
does. The second column indicates the size of messages used by each protocol. 
As the current round number is carried by messages of both protocols, it is not 
indicated. 





# communication steps 


Message size 


CTs 


k — n 


An array of n values G V 


MR with S 


2<fc<2(/ + l) 


A single value G V 



Table 1. Comparing MR with CT^ 
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Table 2 compares MR (when used with OS) with CT05, SC05 and HR05. 
In all cases, there is no bound on the round number and all protocols allow 
early decision. So, the first column compares the time complexity of a round, 
according to the previous discussion (Section 6 . 1 ). The second column is devoted 
to the message size. As each protocol uses messages of different size, we only 
consider their biggest messages. Moreover, as in all protocols, each of those 
messages carries its identity (sender id, round number) and an estimate value, 
the second column indicates only their additional fields. Let us additionally note 
that, differently from SCos and HR05, MR does not require special statements 
to prevent deadlock situations. 





Time complexity of a round 


Message size 


CTos 


(3,0) 


An integer timestamp 


SCo5 


(2,2) 


A boolean and a process id 


HRo5 


(2,1) 


A boolean 


MR with OS 


(2,1) 


No additional value 



Table 2. Comparing MR with CT05, SCos and HR05 



Finally, let us note that MR provides a (factorized) proof, that is shorter and 
simpler to understand than the proofs designed for the other protocols. 

7 Conclusion 

This paper has presented a generic Consensus protocol that works with any 
failure detector belonging to the class S (provided that / < n — 1) or to the 
class OS (provided that / < n/ 2 ). 

The proposed protocol is conceptually simple, allows early decision and uses 
messages shorter than previous solutions. It has been compared to other Consen- 
sus protocols designed for specific classes of unreliable failure detectors. Among 
its advantages, the design simplicity of the proposed protocol has allowed the 
design of a simple (and generic) proof. The most noteworthy of its properties lie 
in its quorum-based approach and in its generic dimension. 

It is important to note that a Consensus protocol initially designed to work 
with a failure detector of the class S will not work when S is replaced by OS. 
Moreover, a Consensus protocol initially designed to work with a failure detector 
of OS requires / < n/2; if OS is replaced by 5 , the protocol will continue to 
work, but will still require / < n/2 which is not a necessary requirement in that 
context. Actually, modifying a 0 «S-based Consensus protocol to work with S 
and / < n — 1 amounts to design a new protocol. The generic dimension of the 
proposed protocol prevents this drawback. In that sense, the proposed protocol 
is the first failure detector-based Consensus protocol that is not bound to a 
particular class of failure detectors. 
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Last but not least, the design of this generic protocol is a result of our effort 
to understand the relation linking S on one side, and OS plus the majority re- 
quirement on the other side, when solving the Consensus problem with unreliable 
failure detectors. 
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Abstract. Quorum-based methods for managing replicated data are 
popular because they provide availability of both reads and writes in 
the presence of faulty behavior by some sites or communication links. 
Over a very long time, it may become necessary to alter the quorum 
system, perhaps because some sites have failed permanently and oth- 
ers have joined the system, or perhaps because users want a different 
trade-off between read- availability and write- availability. There are sub- 
tle issues that arise in managing the change of quorums, including how 
to make sure that any operation using the new quorum system is aware 
of all information from operations that used an old quorum system, and 
how to allow concurrent attempts to alter the quorum system. 

In this paper we use ideas from group management services, especially 
those providing a dynamic notion of primary view; with this we define 
an abstract specification of a system that presents each user with a con- 
sistent succession of identified configurations, each of which has a mem- 
bership set, and a quorum system for that set. The key contribution here 
is the intersection property, that determines how the new configurations 
must relate to previous ones. We demonstrate that our proposed specifi- 
cation is neither too strong, by showing how it can be implemented, nor 
too weak, by showing the correctness of a replicated data management 
algorithm running above it. 



1 Introduction 

In distributed applications involving replicated data, a well known way to en- 
hance the availability and efficiency of the system is to use quorums. A quo- 
rum is a subset of the members of the system, such that any two quorums 
have non-empty intersection. An update can be performed with only a quo- 
rum available, unlike other replication techniques where all of the members 
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must be available. The intersection property of quorums guarantees consis- 
tency. Quorum systems have been extensively studied and used in applica- 
tions, e.g., [1,7,8,18,23,24,34,38]. The use of quorums has been proven effective 
also against Byzantine failures [32,33]. 

Pre-defined quorum sets can yield efficient implementations in settings which 
are relatively static, i.e., failures are transient. However they work less well in 
settings where processes routinely join and leave the system, or where the system 
can suffer multiple partitions. These settings require the on-going modification 
of the choice of quorums. For example, if more sites join the system, quorums 
must be reconfigured to make use of the new sites. If many sites fail permanently, 
quorums must be reconfigured to maintain availability. The most common pro- 
posal has been to use a two-phase commit protocol which stops all application 
operations while all sites are notified of the new configuration. Since two-phase 
commit is a blocking protocol, this solution is vulnerable to a single failure dur- 
ing the configuration change. In a setting of database transactions, [23] showed 
how to integrate fault-tolerant updates of replicated information about quorum 
sizes (using the same quorums for both data item replicas, and for quorum in- 
formation replicas). 

Here we offer a different approach, based on ideas of dynamic primary views 
from group management systems. View- oriented group eommunieation services 
have become important as building blocks for fault-tolerant distributed systems. 
Such a service enables application processes located at different nodes of a fault- 
prone distributed network to operate collectively as a group, using the service to 
multicast messages to all members of the group. Each such service is based on 
a group membership service^ which provides each group member with a view of 
the group; a view includes a list of the processes that are members of the group. 
Messages sent by a process in one view are delivered only to processes in the 
membership of that view, and only when they have the same view. Within each 
view, the service offers guarantees about the order and reliability of message 
delivery. Examples of view-oriented group communication services are found 
in Isis [9], Transis [15], Totem [37], Newtop [20], Relacs [3], Horns [46] and 
Ensemble [45]. 

Eor many applications, some views must be distinguished as primary views. 
Primary views have stronger properties, which allow updates to occur consis- 
tently. Traditionally, a primary view was defined as one containing a majority of 
all possible sites, but other, dynamic, definitions are possible, based on intersec- 
tion properties between successive primary views. One possibility is to define a 
primary view as a view containing a majority of the previous primary view. Sev- 
eral papers define primary views adaptively, e.g., [6,13,14,17,27,35,41,43,47]. Pro- 
ducing good specifications for view-oriented group communication services is dif- 
ficult, because these services can be complicated, and because different such ser- 
vices provide different guarantees about safety, performance, and fault-tolerance. 
Examples of specifications for group membership services and view-oriented 
group communication services appear in [4,5,10,12,16,21,22,25,26,36,39,42,44]. 
Extending these definitions to specify dynamic primary views was the focus of 
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In this paper we combine the notion of dynamic primary view with that of 
a quorum system, and call the result a configuration. We integrate this with 
a group communication service, resulting in a dynamic primary configuration 
group communication service. The main difficulty in combining quorum systems 
with the notion of dynamic primary view is the intersection property between 
quorums from different views, which is required to maintain consistency. With 
configurations the simple intersection property (i.e., a primary view contains a 
majority of the previous primary) that works for primary views, is no longer 
enough. Indeed updated information might be only at a quorum and the pro- 
cessors in the intersection might be not in that quorum. A stronger intersection 
property is required. We propose one possible intersection property that allows 
applications to keep consistency across different primary configurations. Namely, 
we require that there be a quorum of the old primary configuration which is in- 
cluded in the membership set of the new primary configuration. This guarantees 
that there is at least one process in the new primary configuration that has the 
most up to date information. This, similarly to the intersection property of dy- 
namic primary views, allows flow of information from the old configuration to 
the new one and thus permits one to preserve consistency. 

The specific configurations we consider use two sets of quorums, a set of 
read quorums and a set of write quorums, with the property that any read 
quorum intersects any write quorum. (This choice is justified by the application 
we develop, an atomic read/ write register.) With this kind of configuration the 
intersection property that we require for a new primary configuration is that 
there be one read quorum and one write quorum both of which are included in 
the membership set of the new primary configuration. The use of read and write 
quorums (as opposed to just quorums) can be more efficient in order to balance 
the load of the system (see for example [18]). 

We provide a formal automaton specification, called DC for “dynamic config- 
urations”, for the safety guarantees made by a dynamic primary configuration 
group communication service. We remark that we do not address liveness proper- 
ties here, but that they can be expressed as conditional performance properties, 
similar to those in [21], or with other techniques such as failure-detectors [11]. 

Clearly the DC specification provides support for primary configurations. 
However it also has another important feature, namely, it provides support for 
state-exchange. When a new configuration starts, applications generally require 
some pre-processing, such as an exchange of information, to prepare for ordinary 
computation. Typically this is needed in order to bring every member of the con- 
figuration up to date. For example, processes in a coherent database application 
may need to exchange information about previous updates in order to bring ev- 
eryone in the new configuration up to date. We will refer to the up-to-date state 
of a new configuration as the starting state of that configuration. The starting 
state is the state of the computation that all members should have in order 
to perform regular computation. When the notification of a new configuration 
is given to its members, the DC specification allows these members to submit 
their current state. Then the service takes care of collecting all the states and 
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computing the starting state for the new configuration and delivering it to the 
members. When all members have been notified of the starting state for a config- 
uration c, all information about the membership set and the quorums of previous 
configurations is not needed anymore, and the service no longer needs to ensure 
intersection in membership between configurations before c and any subsequent 
ones that are formed. This is the basis of a garbage-collection mechanism which 
was introduced in [47]. 

The DC specification offers a broadcast /convergecast communication service 
which works as follows: a process p submits a message to the service; the ser- 
vice forwards this message to the members of the current configuration and 
upon receiving acknowledgment values from a quorum of members it computes 
a response for the message sent by process p and gives the response to p. This 
communication mechanism has been introduced in [30] , though in the setting of 
that paper there is no group-oriented computation. 

We demonstrate the value of our DC specification by showing both how it 
can be implemented and how it can be used in an application. Both pieces are 
shown formally, with assert ional proofs. 

We implement DC by using a variant of the group membership algorithm 
of [47] . Our variant integrates communication with the membership service, pro- 
vides state- exchange support at the beginning of a new configuration, and uses a 
static configuration-oriented service internally. We prove that this algorithm im- 
plements DC, in the sense of trace inclusion. The proof uses a simulation relation 
and invariant assertions. 

We develop an atomic read/ write shared register on top of DC. The algorithm 
is based on the work of Attiya, Bar-Noy and Dolev [2] and follows the approach 
used in [19,30]. The application exploits the communication and state-exchange 
services provided by DC. The proof of correctness uses a simulation relation and 
invariant assertions. 



2 Mathematical Foundations and Notation 

We describe the services and algorithms using the the I/O automaton model of 
Lynch and Tuttle [31] (without fairness). The model and associated methodology 
is described in Chapter 8 of [29]. 

Next we provide some definitions used in the rest of the paper. 

We write A for the empty sequence. If a is a sequence, then \a\ denotes the 
length of a. If a is a sequence and 1 < i < \a\ then a{i) denotes the ith element 
of a. Given a set S', seqof{S) denotes the set consisting of all finite sequences 
of elements of S. If s and t are sequences, the concatenation of s and t, with s 
coming first, is denoted hy We say that sequence s is a prefix of sequence t, 
written as s < t, provided that there exists u such that s-\-u = t. The “head” of 
a sequence a is a(l). A sequence can be used as a queue: the append operation 
modifies the sequence by concatenating the sequence with a new element and the 
remove operation modifies the sequence by deleting the head of the sequence. 
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If is a binary relation, then we define dom{R)^ the domain of R to be 
the set (without repetitions) of first elements of the ordered pairs comprising 
relation R. 

We denote by V the universe of all processors^ and we assume that V is totally 
ordered. We denote by A4 the universe of all possible messages. We denote by Q 
a totally ordered set of identifiers used to distinguish configurations. Given a 
set S, the notation S± refers to the set S' U {±}. If a set S is totally ordered, we 
extend the ordering of S to the set S± by letting ± < s for any s e S. 

A configuration is a quadruple, c = {g^ P^IZ^W), where ^ G ^ is a unique 
identifier, P C 7^ is a nonempty set of processors, and IZ and W are nonempty 
sets of nonempty subsets^ of P, such that P H W 7 ^ {} for all P G P, W G W. 
Each element of P is called a read quorum of c, and each element of W a write 
quorum. We let C denote the set of all configurations. 

Given a configuration c = {g, P^IZ^W)^ the notation c.id refers to the con- 
figuration identifier the notation c.set refers to the membership set P, while 
c.rqrms and c.wqrms refer to P and W, respectively. We distinguish an initial 
configuration cq = ( 5 ^ 0 , P), Po, Wo), where g^ is a distinguished configuration 
identifier. 

3 The DC Specification 

In many applications significant computation is performed only in special config- 
urations called primary configurations, which satisfy certain intersection prop- 
erties with previous primary configurations. In particular, we require that the 
membership set of a new primary configuration must include the members of at 
least one read quorum and one write quorum of the previous primary configu- 
ration. The DC specification provides to the client only configurations satisfying 
this property. This is similar to what the DVS service of [14] does for ordinary 
views. 

An important feature of the DC specification is that it allows for state- 
exchange at the beginning of a new primary configuration. State-exchange at 
the beginning of a new configuration is required by most applications. When 
a new configuration is issued each member of the configuration is supposed to 
submit its current state to the service which, once obtained the state from all the 
members of the configuration computes the most up to date state over all the 
members, called the starting state, and delivers this state to each member. This 
way, each member begins regular computation in the new configuration knowing 
the starting state. We remark that this is different from the approach used by 
the DVS service of [14] which lets the members of the configuration compute 

^ In the rest of the paper we will use processor as synonymous of process. The differ- 
ences between the two terms are immaterial in our setting. 

^ Expressing each quorum as a set of subsets is a generalization of the common tech- 
nique where the quorums are based on integers nr and nw such that nr Pnw > \P\] 
the two approaches are related by defining the set of read quorums as consisting 
of those subsets of P with cardinality at least nr, and the set of write quorums as 
consisting of those subsets of P of cardinality at least nw ■ 
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the starting state. Some existing group communication services also integrate 
state-exchange within the service [43] . 

Finally, the DC specification offers a broadcast /convergecast communication 
mechanism. This mechanism involves all the members of a quorum, and uses a 
condenser function to process the information gathered from the quorum. More 
specifically, a client that wants to send a message (request) to the members of its 
current configuration submits the message together with a condenser function 
to the service; then the DC service broadcasts the message to all the members 
of the configuration and waits for a response from a quorum (the type of the 
quorum, read or write, is also specified by the client); once answers are received 
from a quorum, the DC service applies the condenser function to these answers 
in order to compute a response to give back to the client that sent the message. 

We remark that this kind of communication is different from those of the 
VS service [21] and the DVS service [14]. Instead, it is as the one used in [30]. We 
integrate it into DC because we want to develop a particular application that 
benefits from this particular communication service (a read/write register as is 
done in [30]). 

Prior to providing the code for the DC specification, we need some notation 
and definitions, which we introduce in the following while giving an informal 
description of the code. 

Each operation requested by the client of the service is tagged with a unique 
identifier. Let OID be the set of operation identifiers, partitioned into sets OIDp, 
p e V. Let M be a set of “acknowledgment” values and let 7^ be a set of “re- 
sponse” values. A value condenser function is a function from to IZ. Let ^ 

be the set of all value condenser functions. Let S be the set of states of the client 
(this does not need to be the entire client’s state, but it may contain only the 
relevant information in order for the application to work). A state condenser 
function is a function from to S. Let iF be the set of all state condenser 

functions. Given a function f : V ^ D from the set of processors V to some 
domain value D and given a subset P C p of processors we write /|P to denote 
the function /' defined as follows: f'{p) = f{p) ifp G P and f'{p) = T otherwise. 

We use the following data type to describe an operation: V = Ai x ^ x 
{“read”, ^^write^^} x 2^ x (M_l)’^ x Bool and we let O = OID V±. Given an 
operation descriptor, selectors for the components are msg^ cnd^ sel^ dlv^ ack^ 
and rsp. Given an operation descriptor d G P for an operation i, d.msg is the 
message of operation i which is delivered to all the processes (it represents the 
request of the operation, like read a register or write a register), d.cnd is the con- 
denser of operation i which is used to compute a response when acknowledgment 
values are available from a quorum of processes, d.sel is a selector that specifies 
whether to use a read or a write quorum, d.dlv is the set of processes to which 
the message has been delivered, d.ack contains the acknowledgment values, and, 
finally, d.rsp is a flag indicating whether or not the client has received a re- 
sponse for the operation. Operation descriptors maintain information about the 
operations. When an operation i is submitted its descriptor d = pending[g]{i) is 
initialized to d = (m, 0, 6, {}, {}, false) where m, (j) and b come with the opera- 
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tion submission (i.e., are provided by the client). Then d.dlv^ d.ack and d.rsp are 
updated while the operation is being serviced. Once a response has been given 
back to the client and thus d.rsp is set to true, the operation is completed. 

For each process p we define the current configuration of p as the last con- 
figuration c given to p with a newconf(c)p event (or a predefined configuration 
if no such event has happened yet). The identifier of the current configuration 
of process p is stored into variable cur-cidp. When a configuration c has been 
notified to a processor p we say that processor p has “attempted” configura- 
tion c. We use the history variable attempted to record the set of processors that 
have attempted a particular configuration c. More formally p G attempted[c.id] 
iff processor p has attempted c. 

Next we define an important notion, the one of “dead” configuration. In- 
formally a dead configuration c is a configuration for which a member pro- 
cess p went on to newer configurations, that is, it executed action newconf(c')p 
with c' .id > c.id, before receiving the notification, that is the newconf(c)p 
event, for configuration c (which can no longer be notified to that proces- 
sor, and thus is dead because processor p cannot participate and it is impos- 
sible to compute the starting state). More formally we define dead G 2^ as 
dead = {c ^ C\3p ^ c. set \ eur-eidp > c.id and p ^ attempted[c.id]} . 



DC (Signature and state) 

Signature: 

Input: 

SUBMIT(m, 0, 5, z)p, m G M, 0 G 

b G {“read”, ^^write^^}, p G Vd G OIDp 
ACKDLVR(a, i)p, a e A, i e OID, p eV 
submit-state(s,'0)p, s e S, 'Ip e 'Ip 



Internal: CREATECONf(c), c G C 
Output: 

newconf(c)p, ceC, p e c.set 
newstate(s)p, s e S 
RESPOND(a,z)p, a e A, i e OIDp, p eV 
DELlVER(m,z)p, m G M, i G OID, p eV 



State: 

created G 2^, init {co} 
for each p G F: cur-cid\p] G 
init po if P G Po , T else 
for each g ^ Q: attempted[g] G 2^, 
init Po if P = Po, {} else 



for each g E Q: 

got-state [p] = P ^ S± , init everywhere T 
condenser [g] —V^ 'ip±, init everywhere T 
state-dlv[g] G 2^, init Po if p = po, {} else 
pending[g] G O, init everywhere T 



Fig. 1. The DC signature and state 



We say that a configuration c is totally attempted in a state s of DC if c.set C 
attempt ed[c. id]. We denote by IbtAtt the set of totally attempted configurations. 
Informally a totally attempted configuration is a configuration for which all 
members have received notification of the new configuration. Similarly, we say 
that a configuration c is attempted in a state s of DC if attempted[c.id] 7 ^ {}. We 
denote by Att the set of attempted configurations. Clearly Att C IbtAtt. 
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DC (Transitions) 



Actions: 

internal createconf(c) 

Pre: For all w C created : c.id ^ w.id 
if c 0 dead then 
For all w G created, w.id < c.id: 
w G dead or 

(3x G IbtSst: w.id<x.id<c.id)\/ 
(3R G w.rqrms, 3W G w.wqrms: 
RUW C c.set) 

For all w G created, w.id > c.id 
w G dead or 

(3x G IbtSst: c.id<x .id<w .id)y 
{3R G c.rqrms,3W G c.wqrms: 
RUW C w.set) 

Eff: created := created U {c} 

output newconf(c)p, p G c.set 
Pre: c G created 

e.id > cur-cid\p] 

Eff: cr^r-czc^[p] := e.id 
attempted [e.id] 

attempted[e.id] U {p} 

input submit-state(s,'0)p 
Eff: if cur-cid\p] ^ T and 

got-state[cur-cid\p]]{p) = T then 
got-state[cur-cid\p]](p) := s 
condenser [cur- cid\p]]{p) := '0 

output newstate(s)p choose e 
Pre: e.id = ctir-czd[p] 
c G created 

\/q G e.set: got-state[e.id](q) ^ T 
s = condenser [c.zd](p)(^ot-state[c.zd]) 
p ^ state-d/r [end] 

Eff: state-d/r [end] 

:= state- dlv [e.id] U {p} 



input SUBMIT {m, (j),b,i)p 
Eff: if cur-cid[p] ^ T then 
pending [cur-cid [p]] 

:= {},{}, false) 

output deliver(?ti,z)p choose g 
Pre: g = cur-cid[p] 

p ^ pending [g]{i). dlv 
pending [g](i) .ms g = m 
Eff: pending [g]{i). dlv 

:= pending [g]{i) .dlv U {p} 

input ACKDLVR(a, z)p 
Eff: if cur-cid[p] ^ T and 

pending [cur- cid[p]](i).ack(p) ^ T 
then 

pending [cur-cid [p]] (i) • ack (p) 

:= a 

output RESPOND(r, z)p choose e,Q 
Pre: e.id = cur-cid\p] 
e G created 
i G OIDp 

pending [e.id]{i) .rsp = false 
if pending[e.id].sel = ^^read’’^ 
then Q G e.rqrms 
if pending [e. id]. s el = ^^write^^ 
then Q G e.wqrms 
let / = pending[e.id]{i) .ack 

yq€Q-- f{q) ^ ± 

r = pendmp[cdd](z).cnd(/|Q) 

Eff: pending [c.id] (i) .rsp := true 



Fig. 2. The DC transitions 



After a processor p has attempted a new configuration, it submits its state by 
means of action submit-state(s, ' 0)p. Variable got-state[g]{p) records the state s 
submitted by processor p for the current configuration of p whose identifier 
is g. Similarly, the state condenser function submitted by p is recorded into 
variable condenser[g]{p) . After all processors members of a configuration c have 
submitted their state, the starting state for c can be computed, by using the 



72 



Roberto De Frisco et al. 



appropriate condenser function, and can be given to the members of c. Note 
that the state condenser is used when all members have submitted a state, 
in contrast to message convergecast which applies the value condenser once a 
quorum of values are known. Variable state- dlv[g] records the set of processors to 
which the starting state for the configuration with identifier g has been delivered. 

When the starting state for a configuration c has been delivered to proces- 
sor p we say that c is established (at p). A configuration is totally established 
when it is established at all processors members of the configuration. More for- 
mally a configuration c is totally established in a state s of DC if, in state 5, we 
have c.set C state-dlv[c.id]. We denote by IbtSst the set of totally established 
configurations. When a configuration c becomes totally established, information 
about the membership set and quorums of configurations previous to c can be 
discarded, because the intersection property will be guaranteed between c and 
later configurations. 

The code of the DC specification is given in Figures 1 and 2. 

The second precondition of createconf(c) is the key to our specification. It 
states that when a configuration c is created it must either be already dead or for 
any other configuration w such that there are no intervening totally established 
configurations, the earlier configuration (i.e., the one with smaller identifier) has 
one read quorum and one write quorum whose members are included in the 
membership set of the later configuration (i.e., the one with bigger identifier). 
The above precondition is formalized in the following key invariant: 

Invariant 1 Let ci,C 2 G created \ dead, with ci.id < C 2 .id. Then either exists 
w G lbt£st,ci.id < w.id < C 2 .id, or else exist R, W quorums of ci such that 
RUW C C2.set 

The property stated by this invariant is used to prove correct the application 
that we build on top of DC. We remark that dead configurations are excluded, 
that is, the intersection property may not hold for dead configurations. However, 
in a dead configuration it is not possible to make progress because for such a 
configuration there is at least one process that will not participate and thus the 
configuration will never become established. 

The need for considering dead configurations comes from the implementation 
of the specification that we provide. It is possible to give a stronger version of 
DC by requiring that the intersection property in the precondition of action 
CREATECONF holds also for dead configurations, however this stronger version 
might not be implement able. Moreover, as we have said above, there is no loss 
of generality since no progress is made anyway in dead configurations. 

4 An Implementation of DC 

The DC specification can be implemented, in the sense of trace inclusion, with 
an algorithm similar to that used in [14] to implement the DVS service. Hence it 
uses ideas from [47]. This implementation consists of an automaton DC-CODEp 
for each p G V. Due to space constraints we omit the code and the proof of 
correctness and provide only an overall description. 
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4.1 The Implementation 

The automaton dc-CODE^ uses special messages, tagged either with “m/o”, used 
to send information about the active and ambiguous configurations, or with 
"‘^got-state’\ used to send the state submitted by a process to all the members 
of the configuration. The former information is needed to check the intersection 
property that new primary configurations have to satisfy according to the DC 
specification. The latter information is needed in order to compute the starting 
state for a new configuration. 

The major problem is that the DC specification requires a global intersection 
property (i.e., a property that can be checked only by someone that knows 
the entire system state), while each single process has a local knowledge of 
the system. So, in order to guarantee that a new configuration satisfies the 
requirement of DC, each single process needs information from other processes 
members of the configuration. 

Informally, the filtering of configurations works as follows. Each process keeps 
track of the latest totally established configuration, called the “active” config- 
uration, recorded into variable act^ and a set of “ambiguous” configurations, 
recorded into variable amb, which are those configurations that were notified 
after the active configuration but did not become established yet. We define 
use = act U amb. When a new configuration is detected, process p sends out 
an message containing its current actp and ambp values to all other pro- 

cessors in the new configuration, using an underlying broadcast communication 
mechanism, and waits to receive the corresponding “m/o” messages for configu- 
ration c from all the other members of c. After receiving this information (and 
updating its own actp and ambp accordingly), process p checks whether c has 
the required intersection property with each view in the usCp set. If so, configu- 
ration c is given in output to the client at p by means of action newconf(c)p. 

When a new primary configuration c has been given in output to processor p 
by means of action newconf(c)p, the client at p submits its current state together 
with a condenser function to be used to compute the starting state when all 
other members have submitted their state (such a condenser function depends 
on the application) . Clearly the state of p is needed by other processors in the 
configuration while p needs the state of the other processors. Hence when a 
SUBMIT- STATE (s,'0)p is cxecutcd at p, the state s submitted by processor p is sent 
out with a “got- state” message to all other members of the configuration. Upon 
receiving the state of all other processors, dc-CODE^ uses the state condenser 
function 7 /^ provided by the client at p in order to compute the starting state to 
be output, by means of action newstate(s)p, to the client at p. 

Finally, the broadcast /convergecast communication mechanism of DC is sim- 
ulated by using the underlying broadcast communication mechanism (this sim- 
ulation is quite straightforward). 

4.2 Proof 

The proof that dc-impl implements DC in the sense of trace inclusion is done 
by using invariants and a simulation relation. The proof is similar to the one 
in [14] used to prove that dvs-impl implements DVS. There is a key difference 
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in the implementation which provides new insights for the DVS specification and 
implementation, as we explain below. 

The DVS specification requires a global intersection property which is the fol- 
lowing: given two primary views w and v with no intervening totally established 
view, we must have that w.set U v.set ^ {}. The DVS implementation, when 
delivering a new view checks a stronger property locally to the processors, 
which requires that \v.set U w.set\ > \w.set\/2 for all the views le, w.id < v.id^ 
known by the processor performing the check. 

The DC specification requires a global intersection property which is the 
following: given two primary configurations, both of which are not dead, with no 
intervening totally established configuration, then it must be that there exists 
a read and write quorum of the configuration with a smaller identifier which 
are included in the membership set of the configuration with bigger identifier. 
The DC implementation checks the same property locally to each processor. The 
intuitive reason why by checking locally the same property we can prove it also 
globally is that we exclude dead configurations. This suggest that also for DVS we 
can prove the stronger intersection property (the one checked locally) or we can 
use a weaker local check (the intersection required globally) if we do exclude 
dead views. 

5 Atomic Read/ Write Shared Memory 
Algorithm 

In this section we show how to use DC to implement an atomic multi- writer multi- 
reader shared register. The algorithm is an extension of the single- writer multi- 
reader atomic register of Attiya, Bar-Noy and Dolev [2]. A similar extension 
was provided in [30]. The overall algorithm is called abd-SYS and consists of an 
automaton abd-CODE^ for eachp G 7^, and DC. Due to space constraint the code 
of automaton ABD-CODEp is omitted from this extended abstract. 

5.1 The Algorithm 

Each processor keeps a copy of the shared register, in variable val paired with a 
tag, in variable tag. Tags are used to establish the time when values are written: a 
value paired with a bigger tag has been written after a value paired with a smaller 
tag. Tags consists of pairs (j,p) where j is a sequence number (an integer) and p 
is a processor identifier. Tags are ordered according to their sequence numbers 
with processors identifiers breaking ties. Given a tag (j,p) the notation t.seq 
denotes the sequence number j. 

The algorithm has two modes of operation: a normal mode and a reeonfig- 
uration mode. The latter is used to establish a new configuration. It is entered 
when a new configuration is announced (action newconf) and is left when the 
configuration becomes established (action newstate). The former is the mode 
where read and write operations are performed and it is entered when a config- 
uration is established and is left when a new configuration is announced. During 
the reconfiguration mode pending operations are delayed until the normal mode 
is restored. 
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Clients of the service can request read and write operations by means of 
actions READp and WRiTE(a:)p. We assume that each client does not invoke a 
new operation request before receiving the response for the previous request. 
Both type of requests (read and write) are handled in a similar way: there is 
a query phase and a subsequent propagate phase. During the query phase the 
server receiving the request “queries” a read-quorum in order to get the value 
of the shared register and the corresponding tag for each of the members of the 
read-quorum. From these it selects the value x corresponding to the max tag t. 
This concludes the query phase. In the propagation phase the server sends a 
new value and a new tag (which are (t, x) for the case of a read^ operation and 
{{t.seq -h l,p), for a WRiTE(y)p operation) to the members of a write quorum. 
These processors update their own copy of the register if the tag received is 
greater than their current tag; then they send back an acknowledgment to the 
server p. When p gets the acknowledgment message from the members of a write 
quorum, the propagate phase is completed. At this point the server can respond 
to the client that issued the operation with either the value read, in the case of 
a read operation, or with just a confirmation, in the case of a write operation. 

We remark that when a configuration change happens during the execution 
of a requested operation, the completion of the operation is delayed until the 
normal mode is restored. However if the query phase has already been completed 
it is not necessary to repeat it in the new configuration. 

5.2 Proof 

The proof that abd-SYS implements an atomic read/write shared register is 
omitted from this extended abstract. The proof uses an approach similar to that 
used in [14] and in [21] to prove the correctness of applications built on top of 
DVS and VS, respectively. 

We remark that the intersection property of DC, namely that there exist 
a read quorum R and a write quorum VF of a previous primary configuration 
both belonging to the next primary configuration comes from this particular 
application. For other applications one might have different (maybe weaker) 
intersection properties. For example, one might require that the new primary 
configuration contains a read quorum of the previous one (and not a write one). 
In our case, we must require both a read quorum and a write quorum in the new 
primary because we want to implement an atomic register and if, for example 
we only require a read quorum to be in the new configuration, it is possible 
that other read quorums of the old configuration will be able to read old values 
making the register not atomic anymore. 

6 Conclusions 

In this paper we have combined the notion of dynamic primary views with that of 
quorum systems, to identify a service that provides configurations. Our key con- 
tribution in solving the problem of making quorums dynamic, that is, adaptable 
to the set of processors currently connected, is to identify a suitable intersection 
property which can be used to maintain consistency across different configura- 
tions. An interesting direction of research is to identify which properties have to 
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be satisfied in order to transform a “static” service or application into a “dy- 
namic” one. For example, some data replication algorithms are based on views 
with a distinguished leader (e.g., [28,40]) and these applications tolerate tran- 
sient failures, i.e., they work well in a static setting. We think that it is possible 
to follow an approach similar to the one used in this paper to transform these 
applications into ones that adapt better to dynamic settings, where processes 
can leave the system forever and new members can join the system. 
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Abstract. We present a model of distributed systems intended for the 
description of group membership services. The model incorporates a gen- 
eralization of failure detectors [9], which we call oracles. Oracles provide 
information about processes that may be included into or excluded from 
the group. Based on this model, we provide a specification of a group 
membership service in asynchronous systems augmented with oracles. 
We also present an algorithm that implements such a service provided 
that the information supplied by the oracles is of sufficient quality. 



1 Introduction 

Several researchers have argued that the design and implementation of complex 
distributed systems, especially those for which fault-tolerance and high avail- 
ability are primary goals, is facilitated by the use of “process groups” (for an 
overview see [16]). Speaking very informally, a process group consists of a col- 
lection of processes that cooperate in carrying out a common task — such as 
maintaining a database, supporting users who are collaborating in a complex 
design, or carrying out a massively parallel computation. In such applications, 
processes benefit from multicast capabilities, so that they can efficiently dissem- 
inate information about their own state to the entire group. 

For multicasts to be meaningful, processes must have a fairly well coordinated 
view of the group’s membership. One complicating factor is that the membership 
of the group changes over time. This is certainly due to failures and recoveries; it 
may also be due to other, application-specific, reasons: Subscribers to a database 
change over time, the participants in the complex design change as the design 
progresses through different phases, and nodes whose spare cycles are used for 
a massively parallel computation may drop in or out of the task as their load 
(and consequently the availability of spare cycles) changes. These requirements 
call for some system-level facilities that allow the (ever-changing) members of 
the group to have consistent views of the (ever-changing) membership. A second 
complicating factor is that often the maintenance of a consistent view of the 
group’s membership must be accomplished in asynchronous distributed systems; 
that is, systems where the speeds of (non- faulty) processes and the delays of 
messages are finite but unbounded. 

Many systems have been developed that provide a group membership ser- 
vice. The first of these systems is the Isis toolkit [6]. Others include the Highly 
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Available System [10], Amoeba [15], Transis [2], Homs [19], Relacs [5], New- 
top [12], and Totem [3]. Group membership services can be classified in one of 
two categories: primary partition and partitionable. A primary-partition group 
membership service ensures that at each (logical) time there is only one view 
of the group, by restricting changes in the membership to only one partition.^ 
A partitionable group membership service allows multiple independent views to 
coexist at the same (logical) time in different partitions. In this paper, we focus 
exclusively on primary-partition services. 

Despite the existence of many systems that offer a group membership ser- 
vice, there is no satisfactory rigorous and abstract specification for that service in 
the context of asynchronous systems that encompasses both safety and liveness 
requirements. Cristian [11] gave such a specification for synchronous systems. 
Ricciardi and Birman [18] gave a specification for asynchronous systems but 
this specification contains serious flaws, as pointed out by Enceaume et al. [4]. 
In particular, the liveness property of the specification is too strong, and it is 
not at all obvious how to weaken it so as to ensure that the resulting property 
is both useful to applications using the service and implement able. De Frisco 
et al. [17] gave a rigorous I/O automata-based specification for group member- 
ship in asynchronous systems but, as the authors expressly acknowledge, their 
specification encompasses only safety properties. 

Guerraoui and Schiper [14] propose a solution to Group Membership that has 
similarities to our approach. They advocate using Gonsensus to solve a variety of 
agreement problems such as Group Membership — but also Nonblocking Atomic 
Commitment, View Synchronous Communication and Atomic Multicast — in 
the context of asynchronous systems with failure detectors. The scope of their 
paper, therefore, is wider than ours, which addresses only Group Membership. 
Their specification of Group Membership, however, is incomplete. 

The difficulty of formulating a useful, yet implement able, specification for 
(primary-partition) group membership services in asynchronous systems is, per- 
haps, not surprising if the matter is considered in the light of the well-known 
impossibility result by Fischer, Lynch and Paterson [13]. This result states that 
in asynchronous systems where (even) one process can crash, any algorithm that 
satisfies the safety property that correct processes should never reach inconsis- 
tent decisions, must violate the liveness property that each correct process should 
eventually reach a decision. It is true that there are differences between the sort 
of agreement that must be reached in the Gonsensus problem (to which this 
impossibility result applies), and the sort of agreement that must be achieved by 
(primary partition) group membership services. Ghandra et ah, however, showed 
that these differences do not render the Fischer, Lynch and Paterson result in- 
applicable: (even) a very weak type of group membership service is susceptible 
to a similar impossibility result [8]. 

In view of this, it is reasonable to try resolving the problem of formulat- 
ing a useful and implementable specification of group membership by bringing 

^ Roughly speaking, a partition is a maximal set of processes that can communicate 
with one another. 
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to bear the approaches which allowed researchers to get around the Fischer- 
Lynch-Paterson impossibility result for Consensus. The use of unreliable failure 
detectors, pioneered by Chandra and Toueg [9], seems especially well-suited to 
the task: One of the reasons why the membership of a group changes is that 
processes fail. We can use information provided by (unreliable) failure detectors 
both to determine the membership of the group (this helps in formulating a 
useful specification) and to achieve agreement among the surviving members of 
the group (this helps in designing an implementation for the specification). 

Considerable modifications to the original model of failure detectors are re- 
quired, however. In part, these are due to the fact that the set of processes in 
the system is no longer fixed but changes over time. Furthermore, as we stated 
earlier, failures (and recoveries) are only one cause of changes in the member- 
ship of a group. We would like our model to be general enough that we can also 
describe other causes. A difficulty we must overcome here is that these causes 
are application- specific while we want our model to be application- independent. 

In this paper, we introduce a model based on an abstraction we call or- 
acles. Each process has access to an include oracle and to an exclude oracle^ 
from which the process can obtain information about processes it should include 
into or exclude from the group. The information provided by oracles regarding 
which processes to include or exclude incorporates whatever (possibly imperfect) 
information the oracles have about failures in the system, but may also incor- 
porate application-specific information. For example, in a database application 
a process might be included because its owner subscribed to the database; in 
a massively-parallel computation a process might be excluded because it runs 
on a node that suddenly became heavily loaded and has no spare cycles. We do 
not model the reasons for inclusion or exclusion — merely the fact. This makes 
the model application-independent. Oracles, and the model of computation, are 
described in Section 2. 

Using this model, we present a rigorous and abstract specification for a group 
membership service in asynchronous systems. Our specification addresses both 
safety and liveness requirements. It is described in Section 3.2. 

Finally, we present an algorithm to implement this specification. The algo- 
rithm is based on Chandra and Toueg’s rotating coordinator algorithm [9], and 
is given in Section 4. Due to space limitations, the proof of correctness of the 
algorithm is omitted. 



2 The Model 



The system consists of an unbounded set of processes, 77 = • • •}, con- 

nected by reliable links. The unbounded set of processes 77 represents the set of 
processes that may participate in the computation. In the course of any particu- 
lar computation only some of these processes will actually participate. Informally 
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speaking, we say that a process p is participating if p is executing steps of the 
protocol and some process (possibly p) wishes p to join the computation^. 

To simplify the model, we assume the existence of a fictional global clock, 
whose output is the set of positive integers denoted by T. The processes do not 
have access to this clock. 

To model the set of processes that actually participate in a computation, and 
the duration of their participation, we define a participation pattern. Formally, 
a participation pattern is a function PP : T ^ 2^ where intuitively PP{t) de- 
notes the set of processes that are participating in the computation at time t. 
We restrict the participation pattern so that there is at most one (possibly in- 
finite) interval of time during which a process is participating. More precisely, 
Vp G il, Vt, t', t" eT, t<t' <t", {pe PP{t) A p 0 PP{t')) ^ p ^ PP{t"). 

We define the set of persistently participating processes in a participation 
pattern as Persist(PP) = {p : G T, Vt' >t,pe PP{t')} . 

2.1 Oracles 

Every process has access to an exclude oracle and an include oracle. Each process 
may query its oracles at any time to obtain a set of processes that the oracle 
wishes to exclude from or include into the group. An exclude oracle may wish 
to exclude a process because it believes (as a failure detector might) that the 
process is no longer participating in the protocol. It may also wish to exclude a 
process for application-specific reasons. Similarly, an include oracle may wish to 
include a process because it believes that the process has recovered, or because 
of application-specific reasons. We do not assume that oracles are “perfect” ; they 
reflect possibly imperfect (and possibly contradictory) information available to 
the processes. Thus, the mere fact that a process p is in some process g^’s exclude 
(or include) oracle, does not necessarily mean that p will be excluded from (or 
included into) the group. However, as it will become clearer when we present 
the liveness properties of our Group Membership specification in Section 3, if 
the information of oracles is persistent and non-contradictory, then the group’s 
membership must conform to the information of the oracles. 

Oracles — in particular exclude oracles — are intended as a generalization 
of failure detectors [9,7]. It is tempting to define oracles in a similar manner. 
There are, however, some differences that complicate somewhat the definition. 
In particular, the set of processes in the system is no longer fixed; it varies as 
processes join or leave the group. In fact, the set of potential processes in the 
system may be infinite. Consider one of the most interesting failure detectors, 
the so-called eventually strong failure detector. This is defined in terms of the 
following two properties: (1) all faulty processes must eventually be suspected 
permanently by all correct processes in the system (eventual strong complete- 
ness); and (2) some correct process must eventually cease being suspected by all 
correct processes in the system (eventual weak accuracy). 

^ Consider two processes, p and q. Process p is participating in the computation but 
experiences a failure and stops participating without executing another step. Pro- 
cess q is participating in the computation and executes all its steps and terminates. 
We model both p and q as no longer participating. 
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In our new setting, “all correct processes in the system” is a huge (possibly 
infinite) set. Furthermore, most processes in this set are irrelevant to the group: 
they will never become members of it. It makes no sense to require the exclude 
oracle to exclude the many processes that may fail but which are not part of 
the group. More importantly, the existence of some process in the system that 
is eventually not excluded by any correct one is small comfort: we want to be 
assured that such a process will exist in the group, not in the entire system 
at large. In other words, we would like to state the completeness and accuracy 
properties relative to the set of processes in “the group”. The trouble is that 
there is no such thing as “the group” yet: we are trying to define a formal model 
in which to describe “the group” ! 

We resolve this apparent circularity by defining the behavior of the exclude 
oracle in terms of a parameter which specifies a set of processes, and is intended 
to be “the current membership of the group” . The exclude oracle, formally speak- 
ing, must be prepared to respond to a query where the set of processes specified 
in this parameter could be any set of processes whatsoever. In fact, processes are 
restricted to querying the exclude oracle using the current membership of the 
group as the actual parameter in the course of an execution. (This will become 
clearer when we define “well- formed runs” in the discussion of computations.) 
With these remarks in mind to help guide our intuition, we are now ready to 
formally define exclude and include oracles. 

An exclude oracle history is a function XOH : 77 x T x 2^ ^ 2^. XOH 
is the set of processes that the exclude oracle of process p, at time t, believes 
should be excluded. The parameter g is intended to represent the current set of 
processes in the group. Thus, informally, an exclude oracle history represents a 
possible behavior of the exclude oracle. An exclude oracle XO is a function that 
maps a participation pattern PP to a set of exclude oracle histories XO{PP). 
This definition reflects the intuition that the behavior of the exclude oracle de- 
pends, in part, on the participation pattern. The participation pattern, however, 
does not completely determine the behavior of the exclude oracle^ — hence the 
exclude oracle maps the participation pattern to a set of possible behaviors. 

An include oracle history is a function 10 H : 77 x T ^ 2^. IOH{pX) rep- 
resents the set of processes that the include oracle of process p believe should 
be included into the group at time t, and is precisely the set it will return if 
it is queried by p at that time. An include oracle 70 is a function that maps 
a participation pattern PP to a set of include oracle histories IO{PP). Note 
that the parameter g used in the definition of an exclude oracle history does not 
apply here, g is intended to represent the set of processes currently in the group; 
it does not contain processes that are trying to join the group but have not yet 
succeeded in doing so. 

In order to circumvent the impossibility result of [8], we must place some 
restrictions on the behavior of the oracles. These restrictions are the analogues 
of the eventual completeness and accuracy properties of the eventual strong 

^ Application semantics and inaccuracies are also reflected in the behavior of the 
exclude oracle. 
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failure detector mentioned earlier, adapted to the setting of exclude oracles. We 
say that an exclude oracle XO is an Eventually Strong XO if the following two 
properties are satisfied: 

Property 1 (XO Eventual Completeness). Eventually every persistently partici- 
pating process in the current group excludes forever every process in the current 
group that isn’t persistently participating. 

VPP, \/XOH G XO{PP), Vp G 2^, G T, Vt' > t, 

\/q ^ qC {n — Persist{PP))^ \/p ^ gC Persist{PP)^ qeXOH{pX\g) 

Property 2 (XO Eventual Aeeuraey). If the current group contains a persistently 
participating process then there is a time after which some persistently par- 
ticipating process in the current group is never excluded by any persistently 
participating process in the current group. 

VPP, yXOH e XO{PP), Vq €2^ {gn Persist(PP) ^ 0 ^ e T, 

3q £ qH Persist(PP), Mt' > t, Vp S ^ fl Persist(PP), q ^ XOH{p, t' , g)) 

We define an include oracle JO as an Eventually Strong 10 if the following 
properties are satisfied: 

Property 3 (10 Eventual Completeness). Eventually every persistently partici- 
pating process permanently includes every persistently participating process. 

VPP, v/op G io{pp), 3t G r, Vt' > t, 

yq G Persist(PP), \/p G Persist(PP), q G IOH{pX') 

Property 4 (10 Eventual Aeeuraey). Eventually no process includes a process 
that is not persistently participating. 

VPP, yiOH G io{pp), 3t G r, vt' > t, 

\/q ^ n — Persist(PP), ^p e Et, q ^ IOH{p^ t') 

2.2 Views 

The term “view” is traditionally used to refer to the current membership of 
the group (at some process). It is appealing to define a view simply as a set 
of processes, but upon reflection this turns out to be inadequate. To see why, 
consider the following situation: Suppose that the membership of the group is 
originally {p^q}. A process r is then added, so that the membership becomes 
{p, q^ r}. Process r is then removed, and the membership becomes {p, g}, again. 
We would like to distinguish the initial view from the final view. Yet, considered 
as sets, the two views are identical. 

This example leads us to define views as identifiers in some set (for the sake 
of concreteness, Z+). We assume that a function can “decode” each identifier v 
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by mapping it into the set of processes that makes up the group’s membership 
when the view is represented by v. Formally, a view is an identifier from the 
set iF = Z+ U {nil^fin}. The special view nil (the null view) represents that 
no view has been installed, and the special view fin (the final view) repre- 
sents the end of a process’s participation in the group. We define iF = iF — 
{nil, fin}. The “decoding” function is Members : iF ^ 2 ^, where Members{v) 
is the set of processes in the view n G iF. We assume that M ember s{nil) = 
M ember s{ fin) = 0. Note that, in general, it is possible to have two views 
v,v' G iF such that n ^ v' , but M ember s{v) = M ember s{v'). Ifp G M ember s{v), 
we say that view v eontains process p. 

We assume that the state of a process includes information about the most 
recently installed view. We define ViewState(a) to be the view corresponding 
to state a. ViewState(a) is nil if no view from iF has been installed and fin if 
the process is no longer participating after having installed a view from iF. 

2.3 Computations 

A eonfiguration is a pair {s, M) where 5 is a function that maps a process p to 
its local state and M is a set of messages (intuitively, the set of messages sent 
and not yet received). An algorithm is a collection of deterministic automata, 
one per process. In a single step of a given algorithm, a process p may perform 
one of the following actions: 

• (send,p,m) send the message m; 

• freer, p,m) receive the message m; 

• {excl,p, g, 7 ) query the exclude oracle of process p about set g with result 7 ; 

• {incl,p, 6) query the include oracle of process p with result 6; 

• {change, p, a) change the state of process p to a. 

If a process p has state a then a step e by p is applieable in configuration 
C = {s, M) if p’s automaton stipulates that the step e can be taken in state a 
and, in addition, m G M for a receive step e = {recv,p,m). e(C) denotes the 
configuration resulting from the application of step e to C. 

A sehedule S of an algorithm A is a sequence of steps of A. 5'[i] is the i^^ step 
of the sequence. A schedule S of an algorithm A is applicable to a configuration C 
if and only if (a) S is the empty schedule, or (b) S[l] is applicable to C and S' [ 2 ] 
is applicable to S[ 1 ](C) and S[3] is applicable to S[ 2 ](S[ 1 ](C)), etc. 

An initial configuration I = {s, M) is valid if M = 0 and there exists a 
unique u* G iF such that for each process p the state function s provides: 
ViewState{s{p)) = u* if p G Membersfv'') and ViewState{s{p)) = nil oth- 
erwise. We define Input{I) to be u*. 

Let A be an algorithm, XO be an exclude oracle, and 10 be an include 
oracle. A run of A with XO and 10 is a tuple R = {PP,XOH,IOH,I , S,T) 
where PP is a participation pattern, XOH G XO{PP) is an exclude oracle 
history, 10 H G IO{PP) is an include oracle history, I is an initial configura- 
tion of A, S is an infinite schedule of A that is applicable to /, and T is an 
infinite sequence of increasing time values (indicating when each step in S oc- 
curred according to the fictional global clock). We define ViewRunfp, i,R) to be 
ViewState{s{p)), where s is the process state function of the configuration after 
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the first i steps of schedule S starting from initial configuration I in the run R] 
we define ViewSet{p,i^ R) = Members{ViewRun{p,i, R)). 

A run, i?, is well-formed if Vi G Z+ and Vp G 77 it satisfies the following 
properties: 

• 7 is a valid initial configuration; 

• if S[i] = {excl^p^ p, 7 ) then g = ViewSet{p^ i, R)^ and 7 = XOH{p, T[i], g); 

• if S[i] = {incl,p,6) then 6 = 7077 (p, T[i]); 

• if p G Persist{PP) then p takes an infinite number of steps in S; 

• every message sent to a process p G Persist(PP) is eventually received; 

• if p 0 PP{T[i]) then X077(p, T[i], p) = 0 for any g; 

• if p ^ PP{T[{\) then 7077 (p, T[i]) = 0; 

• if p ^ PP{T[i]) and 3i' < i such that ViewRun{p,i' ,R) G l7 then 
Vi" > i, ViewRun{p, i", 77) = /in. 

From now on, we consider only well-formed runs. 

3 The Specification 

In this section, we give a specification that captures the basic notions of Group 
Membership. We say that algorithm A solves Group Membership (GM) with 
oracles XO and 70 if every run R = (PP, XOH, lOH, 7, S', T) of A satisfies the 
safety properties and liveness properties described in this section. 

3.1 Safety Properties 

The first three safety properties, which we call “validity properties” require that 
the group’s membership does not change unless there is a good reason. First 
we explain the motivation for such properties. Some previously proposed spec- 
ifications for group membership allow two kinds of undesirable behavior [4]: 
Gapricious removal causes processes to expel others from the group for no rea- 
son; collective suicide causes processes to leave the group for no reason. Such 
behavior is undesirable because it allows for trivial (and useless) implementa- 
tions of the group membership service, which force the group to be the empty 
set for no reason! Our specification avoids this pitfall by insisting that processes 
be removed from or added to the group only if the include or exclude oracles jus- 
tify the change. Notice that unlike other specifications, in which changes to the 
membership are caused by processes executing “join” and “leave” events which 
are under their own control, the output of the include and exclude oracles is not 
under the process’ control. More precisely, we require the following properties: 

Property 5 (Exelude Validity). If p installs view n G l7 and, later, view n' G l7 
where v contains q and v' does not contain q then q was previously in the output 
of some exclude oracle. 

Property 6 (Final View Validity). If p installs fin then p was previously in the 
output of some exclude oracle or p is not persistently participating. 

^ As mentioned in the discussion leading up to the definition of an exclude oracle, 
the exclude oracle may only be queried about the current group. 
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Property 7 (Include Validity). If p installs view v eP where v 7 ^ Input{I) and v 
contains q then q was in the output of some include oracle. 

The validity properties require that the group’s membership should conform 
to the information given by the oracles. Ideally, one would like the members of 
the group at any time to reflect the participation pattern. However, the include 
and exclude oracles are the only means that processes have at their disposal to 
“sense” the participation pattern. Thus, if these oracles provide very inaccurate 
information about the participation pattern, the membership of the group may 
be far from the ideal. This is not a weakness of our specification. On the contrary, 
we believe that it is an advantage of our approach that it can meaningfully 
capture situations where the information about the state of the group is highly 
inaccurate due to unanticipated delays, inappropriately short settings of timeout 
timers or a variety of other reasons that can arise in practice. 

The next two safety properties, which we call “integrity properties” formalize 
the semantics of the special views nil and fin. 

Property 8 (Null View Integrity). If p installs a view v 7 ^ nil then p cannot 
subsequently install nil. 



Property 9 (Final View Integrity), lip installs fin thenp does not subsequently 
install another view. 

Final View Integrity prohibits a process from leaving a group and then joining 
it again. In practice, the same “computational entity” can repeatedly leave and 
rejoin the group under a different name from 77. 

The last safety property, which we call “agreement” , stipulates that the se- 
quence of views installed by the different processes fit together in a global sense: 
there is a single sequence of views such that each process installs a sequence of 
views that is a “window” to this one sequence. 

Property 10 (Agreement) . There exists a sequence G = (i;o, '^ 1 , '^ 2 , • • •) of distinct 
views Vi such that vq = Input{I) and the sequence of views from installed 
by any process p is a contiguous subsequence of G. 

3.2 Liveness Properties 

The validity properties state that the membership does not change without good 
reason. Conversely, the liveness properties state that if a compelling reason does 
exist then the membership of the group will change. We say that a property P 
holds almost always in a run if there is a time after which the property P always 
holds. 

Property 11 (Exclude Progress). Let p be a persistently participating process 
that installs some view. If almost always p excludes process q and almost always 
all processes do not include q then almost always p’s view does not contain q. 
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Property 12 (Include Progress). Let p be a persistently participating process 
that installs some view. If almost always p includes q and q never installs fin 
then almost always p’s view contains q. 

Property 11 states that if there is good reason to exclude a process p from 
the group and no good reasons to include p in the group then p must eventually 
be dropped from the group. Property 12 states a similar progress requirement 
for inclusion into the group. 

Property 13 ( Quiescent Progress). If almost always all processes do not exclude q 
and almost always all processes do not include q then almost always all views 
contain q or almost always no view contains q. 

Property 13 states that the views eventually become stable with respect to 
processes that the oracles are silent about. Suppose the current view at pro- 
cess p contains {p,q,r}. r stops participating, p’s oracle requests the removal 
of r successfully and the next view at p contains {p, q}. All the oracles then give 
imperfect information and request the inclusion of r. Our specification allows 
the next view at p to contain {p^q^r}. As long as the oracles are colluding to 
exclude and then include r, our specification allows this to happen repeatedly. 
When the oracles become silent with respect to r, Property 13 requires that the 
views become stable with respect to r. 

Property 14 (Include Notification). If p installs v G P and v contains a persis- 
tently participating process g, then q eventually installs v. 

Property 14 requires that processes that have been included into the group, 
actually participate in the group. 

Several of the properties in our specification refer to the behavior of the 
oracles. This is a departure from the way in which Consensus is specified in 
the analogous situation where we have failure detectors rather than oracles. 
There, the specification makes reference to the failure pattern (the analogue to 
our participation pattern) but not to the behavior of the failure detectors (the 
analogue to our oracles). Why not specify Group Membership with reference 
only to the participation pattern? The reason is the difference in the nature 
of the two problems. In Group Membership processes must compute something 
(the new view) that is related to the participation pattern. Their ability to do so 
depends on the oracles, which are the only means processes have for observing the 
participation pattern. It is therefore natural that the problem specification refers 
to the oracles. In contrast, what processes compute in Consensus is not in any 
way related to the failure pattern. It therefore stands to reason that the problem 
should be specifiable without reference to the failure detectors, which are the 
only means the processes have for observing the failure pattern. By requiring 
stronger properties of the oracles it is possible to limit and even to completely 
eliminate the reference to the oracles in the specification of Group Membership. 
The resulting specification, however, is meaningful only for restricted classes of 
oracles. We felt that the loss of generality is too high a price to pay for the gain 
in specification simplicity. 
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4 The Algorithm 

In our model with an infinite number of processes and a computation that is 
“long-lived” , it is possible that an infinite number of processes may stop partic- 
ipating in the computation. However, if too many processes stop participating 
simultaneously our algorithm will not be able to make progress. Thus we require a 
minimal level of participation to satisfy the liveness part of the specification. The 
safety properties are never violated. In the case of our group membership algo- 
rithm, we require a majority of processes in any view to be currently in the partic- 
ipation pattern. Formally, Vi G Z+, Vp G 71, (PP(T[i]) 7^ ^ f\ViewSet{p, i, R) ^ 
0^ \ViewSet{p,i,R)r\PP{T[i])\ > \ViewSet{p,i,R)\/2). 

The algorithm assumes that the exclude oracle is an eventually strong XO 
(Properties 1 and 2) and that the include oracle is an eventually strong 10 
(Properties 3 and 4). 

In our pseudo-code, we use the notation xvar ^ XO to indicate that variable 
xvar gets the value of the local exclude oracle when queried about the set of 
processes corresponding to the current view. The notation ivar ^10 indicates 
the variable ivar gets the value of the local include oracle. 

We assume all messages that arrive at a process are kept in a local message 
pool. The pool can be searched by any task of a process for messages satisfying 
certain criteria. We assume that all messages have a tuple form. We define an 
retrieve primitive to extract messages from the pool. 

The retrieve primitive takes as an argument a tuple consisting of two types 
of parameters: “extract” parameters and “pattern” parameters. Parameters that 
are preceded with a ? are extract parameters and parameters that are not pre- 
ceded by a ? are pattern parameters. retriet^e(ai, U2 , . . . , a^) returns false if 
there is no message in the pool that consists of n parameters where each of the 
pattern parameters in the argument to retrieve matches in position and value 
with the parameters of the message in the pool. retrieve{ai, a2 , . . . , a^) returns 
true if there is such a message. Further, each of the extract parameters in the 
argument to retrieve is assigned the value of the corresponding parameter in 
the message. If more than one such message exists in the pool, one is selected 
arbitrarily. 

4.1 View Consensus 

At the heart of our GM algorithm lies a variant of the well-known Consensus 
problem, that we call View Consensus (VC). In this section, we define the prob- 
lem and give an algorithm that solves it. VC is defined in terms of two primitives, 
VC-propose('r’, v') and VC- decide ('T’, v") where v^ v' and v" are views. Intuitively, 
a process invokes VC-propose(i;, v') to propose v' as a possible view to succeed v. 
If a process executes VC-decide(i;, 1;") it has decided on v" as the view to suc- 
ceed V. We use the notation VC-propose(i;, — ) and VC- decide (1;, — ) to indicate 
that the second parameter can be any view. VC is specified by the following 
properties: 

Property 15 (VC-Uniform Agreement). If p executes VC-decide('r’, 'i;') and q ex- 
ecutes VC-decide(i;, 1;") then v' = v" . 
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Property 16 (VC- Termination). If a process p executes VC-propose(u, — ) where 
p G M ember s{v) and p does not stop participating while in the view v then p 
executes VC-decide(n, — ). 



Property 17 (VC-Uniform Integrity). Every process that executes 
VC-decide(n, — ) does it at most once. 



Property 18 (VC-Uniform Validity). If p executes VC-decide(n, n') some process 
previously executed VC-propose(n, n'). 

In Figure 1 we use the rotating coordinator algorithm of Chandra and Toueg 
to implement VC. 

4.2 Group Membership Service 

In Figure 2, we give a protocol to satisfy the specification for CM. The protocol 
runs through three phases. Phase A is only applicable to processes that are not 
part of the initial group. These processes wait until a message of type J arrives 
with the first view to be installed. Once this message is retrieved, processes 
proceed to phase B. Processes in the initial group proceed directly to phase B. 

In phase B of the protocol, each process tries to build, in the variable try set ^ 
a set of processes to propose as the next view. Each process does this by querying 
its local include and exclude oracles. Any process that is successful in building 
a new tryset sends its tryset to all members of the current view. By the end of 
phase B, each member of the current view has some tryset (whether it was sent 
by some other member of the current view or successfully built by querying the 
oracles) to propose for View Consensus. 

Phase C begins with View Consensus for the current view with the proposal 
created in phase B. The process then waits for the resulting new view from View 
Consensus. If the current process is a member of the new view then the process 
installs the new view. Otherwise, the process installs fin. We keep track of all 
processes that have been removed from the group in the variable dropped. These 
processes are prevented from being included again during phase B. At the end 
of phase C, messages of type J are sent to any new processes joining the group 
(which are still in phase A). Finally, the processes that have not installed fin at 
the end of phase C, begin again at phase B. 

The protocol satisfies some common notions of Group Membership that are 
not part of the CM specification. For example, the protocol satisfies Self Inclu- 
sion^ which is not required by the specification. Our model assumes that each 
of the views installed contain a majority of participating processes. A natural 
modification to the algorithm is to ensure the protocol only proposes tryseUs of 
sufficient size to accurately reflect the assumptions of the model. 

^ If p installs a view v eP then p G M ember s{v). 
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Every process p executes the following: 

cobegin 

procedure VC-propose {v,v') 

n ^ \M ember s{v)\ > no. of processes in the view 

est ^ V > the current estimate 

tsp ^0 > timestamp est was adopted 

rnd ^0 > the current round 

while p has not executed VC-decide(r’, — ) do 
rnd ^ rnd + 1 

coord ^ the (rnd mod n) + 1 member of M ember s{v) 
send (PI, rnd, est, t-sp) to coord > PHASE 1: send estimates 

if p = coord then > PHASE 2: gather estimates 

wait until [any \{n + l)/2l processes q: retrieve 

(PI, r, q, rnd, Irep, Ireptsp)] 

msgs[rnd] ^ the set of messages retrieved above 
t ^ max reptsp : —,reptsp) G msgs[rnd] 

est ^ select one rep : (— , — , — , —,rep,t) G msgs[rnd] 
send {P2,v, coord, rnd, est) to M ember s(v) 
exc ^0 > PHASE 3: adopt an estimate 

wait until [retrieve {P2,v, coord, rnd, 7 new est) or coord G {exc ^ XO)] 
if coord ^ exc then 
est ^ newest 
tsp ^ rnd 

send {P3,v,p,rnd, ACK) to coord 

else 

send {P3,v,p,rnd, N AK) to coord 
if p = coord then > PHASE 4- count acks 

wait until [any [(n+l)/2] processes q: retrieve {P3,v,q,rnd, ACK) or 
(P3, V, q, rnd, NAK)] 

if [(n+ l)/2] ACK messages retrieved above then 
send {D,v,p,est) top 

task VC-decide 
while true do 

if retrieve {D, Iv, Iq, Inewview) then 

if p has not executed VC-decide (r, — ) then 
send {D,v, q,newview) to Members{v) 

VC-decide(r, newview) 

coend 

Fig. 1. View Consensus 



5 Concluding Remarks 

By using oracles, we have abstracted, into well defined modules, the reasons a 
process might be included into or excluded from a group by a group membership 
service. Further research is needed into more refined specifications of oracles 
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Every process p executes the following: 



> the current view 



V — nil 

if p is in the initial view then 

V = initial view 
dropped ^ 0 
while V — nil do 

if retrieve ( J, ?i’, 7 dropped) then 
install(i;) 

while V 7 ^ fin do 

try set ^ M ember s{v) 
while try set = M ember s{v) do 

joins ^ lO — dropped — M ember s{v) 
quits ^ (XO — lO) n Members(v) 
try set ^ {M ember s{v) U joins) — quits 
if retrieve {T,v, Isometry set) then 
try set ^ sometryset 
send {T,v,tryset) to Members(v) 

V ^ view encoding such that M ember s{v') = try set and v' > v 

VC-propose(i’, > PHASE C: view consensus 

wait until [some newv: p has executed YC-decided{v, new v)] 

oldv ^ V 

if p G Members (newv) then 



> all processes removed 
\> PHASE A: first view 



> PHASE B: build try set 

> set possible in next view 

> the set of processes to add 

> the set of processes to drop 



else 

v ^ fin 
install(r') 

dropped ^ droppedU {Members (oldv) — Members (newv)) 
send {J, newv, dropped) to Member s{newv) — Member s{oldv) 



Fig. 2. Group Membership Service 



for specific applications, as well as implementations of such application-specific 
oracles. 

In this paper, we did not discuss group communication primitives (multi- 
casts). This can be an extensive topic with a wide variety of interesting ordering 
and delivery requirements. We believe this group membership specification will 
provide a good basis upon which to study multicasts and are currently working 
on the issue. 

In our model, we assume reliable links. We plan to investigate the work of 
Aguilera, Chen and Toueg [1] on intermittent connectivity with respect to our 
model and algorithms. 
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Abstract. Message ordering is a fundamental abstraction in distributed 
systems. However, usual ordering guarantees are purely “syntactic” , that 
is, message “semantics” is not taken into consideration, despite the fact 
that in several cases, semantic information about messages leads to more 
efficient message ordering protocols. In this paper we define the Generic 
Broadcast problem, which orders the delivery of messages only if needed, 
based on the semantics of the messages. Semantic information about 
the messages is introduced in the system by a conflict relation defined 
over messages. We show that Reliable and Atomic Broadcast are spe- 
cial cases of Generic Broadcast, and propose an algorithm that solves 
Generic Broadcast efficiently. In order to assess efficiency, we introduce 
the concept of delivery latency. 



1 Introduction 

Message ordering is a fundamental abstraction in distributed systems. Total or- 
der, causal order, view synchrony, etc., are examples of widely used ordering 
guarantees. However, these ordering guarantees are purely “syntactic” in the 
sense that they do not take into account the “semantics” of the messages. Ac- 
tive replication for example (also called state machine approach [12]), relies on 
total order delivery of messages on the active replicated servers. By considering 
the semantics of the messages sent to active replicated servers, total order de- 
livery may not always be needed. This is the case for example if we distinguish 
read messages from write messages sent to active replicated servers, since read 
messages do not need to be ordered with respect to other read messages. As 
message ordering has a cost, it makes sense to avoid ordering messages when not 
required. 

In this paper we define the Generie Broadeast problem (defined by the prim- 
itives g-Broadeast and g-Deliver)^ which establishes a partial order on message 
delivery. Semantic information about messages is introduced in the system by a 
eonfliet relation defined over the set of messages. Roughly speaking, two mes- 
sages m and m' have to be g-Delivered in the same order only if m and m' are 
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conflicting messages. The definition of message ordering based on a conflict re- 
lation allows for a very powerful message ordering abstraction. For example, the 
Reliable Broadcast problem is an instance of the Generic Broadcast problem in 
which the conflict relation is empty. The Atomic Broadcast problem is another 
instance of the Generic Broadcast problem, in which all pair of messages conflict. 

Any algorithm that solves Atomic Broadcast trivially solves any instance of 
Generic Broadcast (i.e., specified by a given conflict relation), by ordering more 
messages than necessary. Thus, we define a Generic Broadcast algorithm to be 
strict if it only orders messages when necessary. The notion of strictness captures 
the intuitive idea that total order delivery of messages has a cost, and this cost 
should only be paid when necessary. 

In order to assess the cost of Generic Broadcast algorithms, we introduce the 
concept of delivery latency of a message. Roughly speaking, the delivery latency 
of a message m is the number of communication steps between g- Broadcast (m) 
and g-Deliver(m). We then give a strict Generic Broadcast algorithm that is 
less expensive than known Atomic Broadcast algorithms, that is, in runs where 
messages do not conflict, our algorithm ensures that the delivery latency of every 
message is always equal to 2 (known Atomic Broadcast algorithms have at least 
delivery latency equal to 3). 

The rest of the paper is structured as follows. Section 2 defines the Generic 
Broadcast problem. Section 3 defines the system model and introduces the con- 
cept of delivery latency. Section 4 presents a solution to the Generic Broadcast 
problem. Section 5 discusses related work, and Section 6 concludes the paper. 

2 Generic Broadcast 

2.1 Problem Definition 

Generic Broadcast is defined by the primitives g-Broadcast and g-Deliver.^ When 
a process p invokes g-Broadcast with a message m, we say that p g-Broadcasts m, 
and when p returns from the execution of g-Deliver with message m, we say that p 
g-Delivers m. Message m is taken from a set At to which all messages belong. 
Central to Generic Broadcast is the definition of a (symmetric) conflict relation 
on At X At denoted by C (i.e., C C At x At). If (m, m') G C then we say that m 
and m' conflict. Generic Broadcast is specified by (1) a conflict relation C and 
(2) the following conditions: 

gB-1 (Validity) If a correct process g-Broadcasts a message m, then it even- 
tually g-Delivers m. 

gB-2 (Agreement) If a correct process g-Delivers a message m, then all correct 
processes eventually g-Deliver m. 

gB-3 (Integrity) For any message m, every correct process g-Delivers m at 
most once, and only if m was previously g-Broadcast by some process. 



^ g-Broadcast has no relation with the GBGAST primitive defined in the Isis sys- 
tem [1]. 
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gB-4 (Partial Order) If correct processes p and q both g-Deliver messages m 
and m', and m and m' conflict, then p g-Delivers m before m' if and only 
if q g-Delivers m before m' . 

The conflict relation C determines the pair of messages that are sensitive to 
order, that is, the pair of messages for which the g-Deliver order should be the 
same at all processes that g-Deliver the messages. The conflict relation C renders 
the above specification generic^ as shown in the next section. 



2.2 Reliable and Atomic Broadcast as Instances of Generic 
Broadcast 

We consider in the following two special cases of conflict relations: (1) the empty 
conflict relation, denoted by C 0 , where C 0 = 0, and (2) the A4 x A4 conflict 
relation, denoted by CmxM^ where CmxM = At x At. In case (1) no pair of 
messages conflict, that is, the partial order property gB-4 imposes no constraint. 
This is equivalent to having only the conditions gB-1, gB-2 and gB-3, which is 
called Reliable Broadeast [4]. In case (2) any pair (m,m') of messages conflict, 
that is, the partial order property gB-4 imposes that all pairs of messages be 
ordered, which is called Atomie Broadeast [4]. In other words. Reliable Broadcast 
and Atomic Broadcast lie at the two ends of the spectrum defined by Generic 
Broadcast. In between, any other conflict relation defines an instance of Generic 
Broadcast. 

Gonfiict relations lying in between the two extremes of the conflict spectrum 
can be better illustrated by an example. Gonsider a replicated Aeeount object, 
defined by the operations deposit (x) and withdraw (x). Clearly, deposit operations 
commute with each other, while withdraw operations do not, neither with each 
other nor with deposit operations.^ Let M deposit denote the set of messages 
that carry a deposit operation, and Ai withdraw the set of messages that carry a 
withdraw operation. This leads to the following conflict relation C Account- 

^Account — { (tTT/, 777/ ) . 777/ G Aiwithdraw Dr 777 G Aiwithdraw^- 

Generic Broadcast with the C Account conflict relation for broadcasting the invoca- 
tion of deposit and withdraw operations to the replicated Aeeount object defines 
a weaker ordering primitive than Atomic Broadcast (e.g., messages in AAdcposit 
are not required to be ordered with each other), and a stronger ordering primitive 
than Reliable Broadcast (which imposes no order at all). 



2.3 Strict Generic Broadcast Algorithm 

From the specification it is obvious that any algorithm solving Atomic Broadcast 
also solves any instance of the Generic Broadcast problem defined by 
C ^ Mx M. However, such a solution also orders messages that do not conflict. 

^ This is the case for instance if we consider that a withdraw (x) operation can only be 
performed if the current balance is larger than or equal to x. 
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We are interested in a strict algorithm, that is, an algorithm that does not order 
two messages if not required, according to the conflict relation C. The idea is 
that ordering messages has a cost (in terms of number of messages, number of 
communication steps, etc.) and this cost should be kept as low as possible. More 
formally, we define an algorithm that solves Generic Broadcast for a conflict 
relation C C A4 x Ai, denoted by Ac, strict if it satisfies the condition below. 

(Strictness) Consider an algorithm Ac, and let be the set of runs 

of Ac. There exists a run R in IZc^ , in which at least two correct processes 
g-Deliver two non-conflicting messages m and m' in a different order. 

Informally, the strictness condition requires that algorithm Ac allow runs in 
which the g-Deliver of non conflicting messages is not totally ordered. However, 
even if Ac does not order messages, it can happen that total order is sponta- 
neously ensured. So we cannot require violation of total order to be observed in 
every run: we require it in at least one run of Ac- 

3 System Model and Definitions 

3.1 Processes, Failures and Failure Detectors 

We consider an asynchronous system composed of n processes 71 = {p \, ... ,Pn}- 
Processes communicate by message passing. A process can only fail by crashing 
(i.e., we do not consider Byzantine failures). Processes are connected through 
reliable channels, defined by the two primitives send{m) and receive{m). We as- 
sume that the asynchronous system is augmented with failure detectors allowing 
to solve Consensus (e.g., the class of failure detector OS allows Consensus to be 
solved if the maximum number of failures is smaller than n/2) [2]. 



3.2 Delivery Latency 

In the following, we introduce the delivery latency as a parameter to measure the 
efficiency of algorithms solving a Broadcast problem (defined by the primitives 
(T-Broadcast and a-Deliver). The deliver latency is a variation of the Latency 
Degree introduced in [11], which is based on modified Lamport’s clocks [7]. 

— a send event and a local event on a process p do not modify p’s local clock, 

— let ts{send{m)) be the timestamp of the send{m) event, and ts{m) the times- 

def 

tamp carried by message m: ts{m) = ts{send{m)) -h 1, 

— the timestamp of receive{m) on a process p is the maximum between ts{m) 
and p’s current clock value. 

The delivery latency of a message m ^-Broadcast in a run R of an algo- 
rithm A solving a Broadcast problem, denoted by dl^{m), is defined as the 
difference between the largest timestamp of all (a-Deliver(m) events (at most 
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one per process) in run and the timestamp of the (a-Broadcast(m) event in 
run R. 

Let 7T^(m) be the set of processes that ^-Deliver message m in run R^ and 
(a-Deliverp(m) the (a-Deliver(m) event at process p. The deliver latency of m in 
run R is formally defined as 

dl^{m) MAX (ts((a-Deliverp(m)) — ts((a-Broadcast(m))). 

pE7r^{m) 



For example, consider a broadcast algorithm where a process p, wishing to 
broadcast a message m, (1) sends m to all processes, (2) each process q on 
receiving m sends an acknowledge message ACK{m) to all processes, and (3) as 
soon as q receives Uack messages of the type ACK{m), q delivers m. Let be a 
run of this algorithm where only m is broadcast. We have dl^{m) = 2. 



4 Solving Generic Broadcast 

4.1 Overview of the Algorithm 

Processes executing our Generic Broadcast algorithm progress in a sequence of 
stages numbered l,2,...,/c,.... Stage k terminates only if two conflicting messages 
are g-Broadcast, but not g-Delivered in some stage k' < k. 

g-Delivery of non- conflicting messages. Let m be a message that is g-Broadcast. 
When some process p receives m in stage /c, and m does not conflict with some 
other message m' already received by p in stage k, then p inserts m in its 
pending^ set, and sends an ACK(m) message to all processes. As soon as p 
receives ACK(m) messages from Uack processes, where 



riack >{n^ l)/2. 



( 1 ) 



p g-Delivers m. 



g-Delivery of conflicting messages. If a conflict is detected. Consensus is launched 
to terminate stage k. The Consensus decides on two sets of messages, denoted 
by NCmsgSet^ {NC stands for Non- Conflicting) and CmsgSet^ {C stands for 
Conflicting). The set NCmsgSet^ U CmsgSet^ is the set of all messages that 
are g-Delivered in stage k. Messages in NCmsgSet^ are g-Delivered before mes- 
sages in CmsgSet^ ^ and messages in NCmsgSet^ may be g-Delivered by some 
process p in stage k before p executes the k-th Consensus. The set NCmsgSet^ 
does not contain conflicting messages, while messages in CmsgSet^ may conflict. 
Messages in CmsgSet^ are g-Delivered in some deterministic order. Process p 
starts stage /c + 1 once it has g-Delivered all messages in CmsgSet^ . 
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Properties. To be correct, our algorithm must satisfy the following properties: 



(a) If two messages m and m' conflict, then at most one of them is g-Delivered 
in stage k before Consensus. 

(b) If message m is g-Delivered in stage k by some process p before Consensus, 
then m is in the set NCmsgSet^ . 

(c) The set NCmsgSet^ does not contain any conflicting messages.^ 

Property (a) is ensured by condition (1). Property (b) is ensured as follows. 
Before starting Consensus, every process p sends its pending^ set to all processes 
(in a message of type eheeking, denoted by CHK), and waits for messages of 
type CHK from exactly richk processes. Only if some message m is at least in 
\{"^chk + l)/2l messages of type CHK, then m is inserted in majMSet^^ the 
initial value of Consensus that decides on NCmsgSet^. So, if m is in less than 
li^chk + l)/2l messages of type CHK, m is not inserted in majMSet^. Indeed, 
if condition 



‘^^ack T ^chk ^ ^77/ T 1 (2) 

holds, then m could not have been g-Delivered in stage k before Consensus. To 
understand why, notice that from (2), we have 

(n "klchk) T li^f^chk T f)/2] ^ '^ack'i (3) 

where (n — richk) is the number of processes from which p knows nothing. 
From (3), if m is in less than \{richk + l)/2] messages of type CHK, then even 
if all processes from which p knows nothing had sent ACK{m)^ there would not 
be enough ACK{m) messages to have m g-Delivered by some process in stage k 
before Consensus. 

Property (c) is ensured by the fact that m is inserted in majMSet^ only 
if m is in at least \ {richk + l)/2] messages of type CHK received by p (majority 
condition). Let m and m' be two messages in majMSetp. By the majority 
condition, the two messages are in the pending^ set of at least one process q. 
This is however only possible if m and m' do not conflict. 



Minimal number of eorreet proeesses. Our Generic Broadcast algorithm waits 
for Uack messages before g-Delivering non-conflicting messages, and Uchk mes- 
sages if a conflict is detected before starting Consensus. So our algorithm requires 
max{nack^'^chk) correct processes. The minimum of this expression happens to 
be (2n -h l)/3, when Uack = richk- 

^ Property (c) does not follow from (a) and (b). Take for example two messages m 
and m' that conflict, but are not g-Delivered in stage k without the cost of Consensus: 
neither property (a), nor property (b) applies. 
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4.2 The Generic Broadcast Algorithm 

Provided that the number of correct processes is at least max{riacki'^chk)i '^ack > 
(n + l)/2, and 2nack ^"^chk > 2n + 1, Algorithm 1 solves Generic Broadcast for 
any conflict relation C. All tasks in Algorithm 1 execute concurrently, and Task 3 
has two entry points (lines 12 and 31). Process p in stage k manages the following 
sets. 

— R-deliveredp'. contains all messages R-delivered by p up to the current time, 

— G -deliver edp\ contains all messages g-Delivered by p in all stages k' < k, 

— pending contains every message m such that p has sent an ACK message 
for m in stage k up to current time, and 

— I oealN C g -Deliver is the set of non conflicting messages that are 
g-Delivered by p in stage /c, up to the current time (and before p executes 
the k-th Consensus). 

When p wants to g-Broadcast message m, p executes R-broadcast{m) (line 8). 
After R-delivering a message m, the actions taken by p depend on whether m 
conflicts or not with some other message m' in R-deliveredp \ G -deliver edp. 

No conflict. If no conflict exists, then p includes m in pending^ (line 14), and 
sends an AGK message to all processes, acknowledging the R-deliver of m 
(line 15). Once p receives nack AGK messages for a message m (line 31), p 
includes m in localNGg-Deliverp (line 35) and g-Delivers m (line 36). 

Conflict. In case of conflict, p starts the terminating procedure for stage k. 
Process p first sends a message of the type {k^pending^^ GHK) to all processes 
(line 17), and waits the same information from exactly nchk processes (line 18). 
Then p builds the set majMSet^ (line 20).^ It can be proved that majMSet^ 
contains every message m such that for any process m ^ localNGg -Deliver 
Then p starts consensus (line 21) to decide on a pair {NGmsgSet^ ^ GmsgSet^) 
(line 22). Once the decision is made, process p first g-Delivers (in any order) 
the messages in NGmsgSet^ that is has not g-Delivered yet (lines 23 and 25), 
and then p g-Delivers (in some deterministic order) the messages in GmsgSeC 
that it has not g-Delivered yet (lines 24 and 26). After g-Delivering all messages 
decided in Consensus execution /c, p starts stage k 1 (lines 28-30). 

4.3 Proof of Correctness 

Due to space limitations, we have only included some of the proofs in this sec- 
tion. All proofs (Agreement, Partial Order, Validity, and Integrity) can be found 
in [9]. In the following, we prove that the three properties ((a)-(c)) presented in 
Section 4.1 hold. 

Lemma 1 states that the set pending^ does not contain conflicting messages. 
It is used to prove Lemmata 2 and 5 below. 



^ majMSetp — {m : \Chkp{rn)\ > {richk + l)/2} 
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Algorithm 1 Generic Broadcast 



1: Initialisation: 

2: Redelivered ^ 0 

3: G-delivered ^ 0 

4: k^l 

5: pending^ ^ 0 

6: loealNCg -Deliver^ ^ 0 

7: To execute g-Broadcast(m): {Task 1} 

8: R-broadcast{m) 

9: g-Deliver{—) occurs as follows: 

10: when R-deliver{m) {Task 2} 

11: Redelivered ^ Redelivered U {m} 

12: when [Redeliver ed\Gedelivered) \ pending^ ^ 0 {Task 3} 

13: if [ for all m, m' G Redelivered \ Gedelivered, m ^ m' : (m, m') 0 Gonfliet ] 

then 

14: pending^ ^ Redelivered \ Gedelivered 

15: send[k ^ pending^ ^ AG K) to all 

16: else 

17: send[k ^ pending^ ^GHK) to all 

18: wait until [ for richk processes q : p received (k, yendiriQ^ , G H K ) from q ] 

19: # Define Ghk^[m) — {q \ p received (fe, vending^ , G H K ) from q and 

m G pending q } 

20: majMSet^ ^ {m : | Ghk^[m) \ > \[nchk + l)/2]} 

21: propose [k, [majMSet^ , [Redelivered \ Gedelivered) \ majMSet^)) 

22: wait until decide[k, [NGmsgSet^ ,GmsgSet^)) 

23: NGgeDeliver^ ^ [NGmsgSet^ \ loealNGpeDeliver^) \ Gedelivered 

24: GgeDeliver^ ^ GmsgSet^ \ Gedelivered 

25: g-Deliver messages in NGgeDeliver^ in any order 

26: g-Deliver messages in GgeDeliver^ using some deterministic order 

27: Gedelivered ^ [loealNGpeDeliver^ U NGpeDeliver^ U GgeDeliver^)U 

Gedelivered 

28: k^k + 1 

29: pending^ ^ 0 

30: loealNGgeDeliver^ ^ 0 

31: when receiveffe, vending q , AG K ) from q 

32: #Define Aek^[m) = {q '■ p received ffe, vending q , AG K ) from q and 

m G pending q } 

33: aekMSet^ ^ {m : \Aek^[m)\ > nack} 

34: loealNGmsgSet^ ^ aekMSet^ \ [GedeliveredU NGmsgSet^) 

35: loealNGgeDeliver^ ^ loealNGgeDeliver^ U loealNGmsgSet^ 

36: g-Deliver all messages in loealNGmsgSet^ in any order 
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Lemma 1. For any process p, and all k > 1, if messages m and m' are in 
pending p, then m and m' do not conflict. 

Proof: Suppose, by way of contradiction, that there is a process p, and some 
k >1 such that m and m' conflict and are in pending Since m and m' are in 
pending^ ^ p must have R-delivered m and m' . Assume that p first R-delivers m 
and then m' . Thus, there is a time t after p R-delivers m' such that p evaluates 
the if statement at line 13, and m' G R-deliveredp^ m' 0 G-deliveredp, and 
m' ^ pending p. At time t^m ^ R-deliveredp (by the hypothesis m is R-delivered 
before m'), and m ^ G-deliveredp (if m G G -delivered^ from lines 27-29 m 
and m' cannot be both in pending^). Therefore, when the if statement at line 13 
is evaluated, m and m' are in Redelivered \ G-deliveredp and since m and m' 
conflict, the condition evaluates false, and m' is not included in pending a 
contradiction that concludes the proof. □ 

Lemma 2 proves property (a). 

Lemma 2. If two messages m and m' conflict, then at most one of them is 
g-Delivered in stage k before Consensus. 

Proof: The proof is by contradiction. Assume that there are two messages m 
and m' that conflict and are g-Delivered in stage k before Consensus. Without 
lack of generality, consider that m is g-Delivered by process p, and m' is g- 
Delivered by process q. From the Generic Broadcast algorithm (lines 31-36), p 
{q) has received nack messages of the type {k , pending^ , AG K) such that m G 
pending^ {m' G pending^). Since nack > (r + l)/2, there must be a process r 
that sends the message (k, pending^, AG K) to processes p and q, such that m 
and m' are in pending^, contradicting Lemma 1. □ 

Lemma 3 relates (1) the set Aek^{m) of processes that send an acknowledge- 
ment for some message m in stage k and (2) the set Ghk^ of processes from which 
some process p receives GHK messages in stage k, with (3) the set Ghk^{m) of 
processes from which p receives n GHK message containing m in stage k. 

Lemma 3. Let Aek^{m) be a set of processes that execute the statement 
send{kp pending^ P AG K) (line 15) in stage k with m G pending^, and let Ghk^ 
be the set of processes from which some process p receives messages of the type 
{kppending^ pGHK) in stage k (line 18). If Aek^{m) > nack^Ghk^ > nchk, 
and 2nack + '^chk ^ 2n 1, then there are at least \{nchk + l)/2] processes in 

Ghkp{m) Ghkp nAek^{m). 

Proof: We prove the contrapositive, that is, if \Ghkp{m)\ < \{nchkFl)/2] then 
\Aek^{m)\ < riack- From the definitions of Aek^{m) and 

Ghkp{m)p it follows that \Aek^{m)\ < {n — nchk) + \Ghk^{m)\ (1). To see why, 
notice that set Ghkp{m) contains all processes from set Ghk^ that sent an ac- 
knowledgement message for m. Process p does not know anything about the 
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remaining processes in 77 \ Chk^^ but even if all of them acknowledged mes- 
sage m, the number of acknowledges is at most equal to {n — richk)- 

From (1) and the fact that \Chkp{m)\ < |"(nc/i/c + l)/2], we have |74c/c^(m)| — 
(n-richk) < \Chk^{m)\ < |■(nc/^fe+l)/2] . Thus, {n-richk) > \Ack^{m)\-\{nchk+ 
l)/2] (2). From 2uack + ^chk > 2n -h 1, and the fact that Uack^ T^chk^ and n are 
integers, we have that (n — richk) ^ "^ack ~ \{'^chk + l)/2] (3). Therefore, from 
(2) and (3), we conclude that \Ack^{m)\ < Uack- 

Lemma 4 proves property (b) presented in Section 4.1. It states that any mes- 
sage g-Delivered by some process q during stage /c, before q executes Consensus 
in stage k will be included in the set NCmsgSet^ decided by Consensus k. 

Lemma 4. For any two processes p and q, and all k > 1, if p executes 
decideik, {NCmsgSet^ , — )), then localNC g -Deliver ^ C NCmsgSet^ . 

Proof: Let m be a message in loealNCg -Deliver^. We first show that if p exe- 
cutes the statement propose{k, majMSetp^ — )), then m G majMSetp. Since m G 
loealNCg -Deliver q must have received Uack messages of the type 
{k ^ pending^ ^ AC K) (line 31) such that m G pending^. Thus, there are Uack 
processes that sent m to all processes in the send statement at line 15. From 
Lemma 3, Chk^{m) > (uchk + l)/2, and so, from the algorithm line 20, m G 
majMSetp. Therefore, for every process q that executes 
propose{k, {majMSet^^ — )), m G majMSetp. Let {NCmsgSeC ^ — ) be the value 
decided on Consensus execution k. By the uniform validity of Consensus, there is 
a process r that executed propose{k, {majMSet^^ —)) such that NCmsgSeC = 
majMSetp, and so, m G NCmsgSeC . □ 

Lemma 5 proves property (c). 

Lemma 5. If two messaqes m and m' conflict, then at most one of them is in 
NCmsgSetK 

Proof: The proof is by contradiction. Assume that there are two messages m 
and m' that conflict, and are both in NCmsgSeC . From the validity property of 
Consensus, there must be a process p that executes proposefk, {majMSetp, — )), 
such that NCmsgSeC = majMSetp. Therefore, m and m' are in majMSetp, 
and from the algorithm, p receives \{nchk + l)/2] messages of the type 
{k, pending^, CHK) such that m is in pending^, and p also receives \{nchk + 
l)/2] messages of the type {k, pending^ ,CHK) such that m' is in pending^. 
Since p waits for Uchk messages of the type {k, pending^ ,CHK), there must 
exist at least one process q in Chkp such that m and m' are in pending^, con- 
tradicting Lemma 1. □ 



4.4 Strictness and Cost of the Generic Broadcast Algorithm 

Proposition 5 states that the Generic Broadcast algorithm of Section 4.2 is a 
strict implementation of Generic Broadcast. 
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Proposition 5. Algorithm 1 is a strict Generic Broadcast algorithm. 

We now discuss the cost of our Generic Broadcast algorithm. Our main result 
is that for messages that do not conflict, the Generic Broadcast algorithm can 
deliver messages with a delivery latency equal to 2, while for messages that con- 
flict, the delivery latency is at least equal to 4. Since known Atomic Broadcast 
algorithms deliver messages with a delivery latency of at least 3,^ this results 
shows the tradeoff of the Generic Broadcast algorithm: if messages conflict fre- 
quently, our Generic Broadcast algorithm may become less efficient than an 
Atomic Broadcast algorithm, while if conflicts are rare, then our Generic Broad- 
cast algorithm leads to smaller costs compared to Atomic Broadcast algorithms. 

Propositions 6 and 7 assess the cost of the Generic Broadcast algorithm when 
messages do not conflict. In order to simplify the analysis of the delivery latency, 
we concentrate our results on runs with one message (although the results can 
be extended to more general runs). Proposition 6 defines a lower bound on the 
delivery latency of the algorithm, and Proposition 7 shows that this bound can 
be reached in runs where there are no process failures. We consider a particular 
implementation of Reliable Broadcast that appears in [2].^ 

Proposition 6. Assume that Algorithm 1 uses the Reliable Broadcast imple- 
mentation presented in [2]. If Tic 'Is a set of runs generated by Algorithm 1 such 
that m is the only message g-Broadcast and g-Delivered in runs in Tic, then there 
is no run R in Tic where dl^{m) < 2. 



Proposition 7. Assume that Algorithm 1 uses the Reliable Broadcast imple- 
mentation presented in [2]. If Tic 'Is a set of runs generated by Algorithm 1, such 
that in runs in He, m is the only message g-Broadcast and g-Delivered, and there 
are no process failures, then there is a run R in He where dl^{m) = 2. 

The results that follow define the behaviour of the Generic Broadcast algo- 
rithm in runs where conflicting messages are g-Broadcast. Proposition 8 estab- 
lishes a lower bound for cases where messages conflict, and Proposition 9 shows 
that the best case with conflicts can be reached when there are no process failures 
nor failure suspicions. 

Proposition 8. Assume that Algorithm 1 uses the Reliable Broadcast imple- 
mentation presented in [2], and the Consensus implementation presented in [11]. 
Let He be a set of runs generated by Algorithm 1, such that m and m' are the 
only messages g-Broadcast and g-Delivered in He . If m and m' conflict, then 
there is no run R in He where dl^{m) < 4 and dl^{m') < 4. 

^ An exception is the Optimistic Atomic Broadcast algorithm [8], which can deliver 
messages with delivery latency equal to 2 if the spontaneous total order property 
holds. 

® Whenever a process p wants to R-broadcast a message m, p sends m to all processes. 
Once a process q receives m, ii q A P then q sends m to all processes, and q R- 
delivers m. 
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Proposition 9. Assume that Algorithm 1 uses the Reliable Broadeast imple- 
mentation presented in [2], and the Consensus implementation presented in [11]). 
Let IZc be a set of runs generated by Algorithm 1, sueh that m and m' are the 
only messages g-Broadeast and g-Delivered in IZc, and there are no proeess fail- 
ures nor failure suspieions. If m and m' eonfliet, then there is a run R in IZc 
where m is g-Delivered before m' and dl^{m) = 2 and dl^{m') = 4. 



5 Related Work 

Group communication aim at extending traditional one-to-one communication, 
which is insufficient in many settings. One-to-many communication is typically 
needed to handle replication (replicated data, replicated objects, etc.). Classical 
techniques to manage replicated data are based on voting and quorum systems 
(e.g., [3,5,6] to cite a few). Early quorum systems distinguish read operations 
from write operations in order to allow for concurrent read operations. These 
ideas have been extended to abstract data types in [5] . Increasing concurrency, 
without compromising the strong consistency guarantees on replicated data, is 
a standard way to increase the performance of the system. Lazy replication [10] 
is another approach that aims at increasing the performance by reducing the 
cost of replication. Lazy replication also distinguishes between read and write 
operations, and relaxes the requirement of total order delivery of read operations. 
Consistency is ensured at the cost of managing timestamps outside of the set of 
replicated servers; these timestamps are used to ensure Causal Order delivery 
on the replicated servers. 

Our approach also aims at increasing the performance of replication by in- 
creasing concurrency in the context of group communication. Similarly to quo- 
rum systems, our Generic Broadcast algorithm allows for concurrency that is 
not possible with traditional replication techniques based on Atomic Broadcast. 
From this perspective, our work can be seen as a way to integrate group com- 
munications and quorum systems. There is even a stronger similarity between 
quorum systems and our Generic Broadcast algorithm. Our algorithm is based 
on two sets: an acknowledgement set and a checking set.^ These sets play a role 
similar to quorum systems. However, quorum systems require weaker conditions 
to keep consistency than the condition required by the acknowledgement and 
checking sets.^ Although the reason for this discrepancy is very probably re- 
lated to the guarantees offered by quorum systems, the question requires further 
investigation. 



^ Used respectively for g-Delivering non-conflicting messages during a stage, and de- 
termining non-conflicting messages g-Delivered at the termination of a stage. 

® Let nr be the size of a read quorum, and nw the size of a write quorum. Quorum 
systems usually requires that nr A nw > n -h 1. 
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6 Conclusions 

The paper has introduced the Generic Broadcast problem, which is defined based 
on a conflict relation on the set of messages. The notion of conflict can be derived 
from the semantic of the messages. Only conflicting messages have to be delivered 
by all processes in the same order. As such, Generic Broadcast is a powerful 
message ordering abstraction, which includes Reliable and Atomic Broadcast as 
special cases. The advantage of Generic Broadcast over Atomic Broadcast is a 
cost issue, where cost is defined by the notion of delivery latency of messages. 

On a different issue, our Generic Broadcast algorithm uses mechanisms that 
have similarities with quorum systems. As future work it would be interesting 
to investigate this point to better understand the differences between replica- 
tion protocols based on group communication (e.g.. Atomic Broadcast, Generic 
Broadcast) and replication protocols based on quorum systems. 

Finally, as noted in Section 4.1, our Generic Broadcast algorithm requires at 
least (2n + l)/3 correct processes. Such a condition is usual in the context of 
Byzantine failures, but rather surprising in the context of crash failures. 
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Abstract. Quorum systems have been used to implement many coor- 
dination problems in distributed systems. In this paper, we propose a 
reformulation of the definition of Byzantine quorum systems. Our refor- 
mulation captures the requirement for non-blocking access to quorums 
in asynchronous systems. We formally define the asynchronous access 
cost of quorum systems and we show that the asynchronous access cost 
and not the size of a quorum is the right measure of message complex- 
ity of protocols using quorums in asynchronous systems. We also show 
that previous quorum systems proposed in the literature have a very 
high asynchronous access cost. We present new quorum systems with 
low asynchronous access cost and whose other performance parameters 
match those of the best Byzantine quorum systems proposed in the lit- 
erature. In particular, we present a construction for the disjoint failure 
pattern that outperforms previously proposed systems for that pattern. 

Keywords: quorum, tolerance, Byzantine, failures, distributed, asyn- 
chronous, access cost. 



1 Introduction 

A quorum system is a collection of sets (quorums) that mutually intersect. Quo- 
rum systems have been used to implement mutual exclusion [1,8], replicated data 
systems [7], commit protocols [15], and distributed consensus [11]. For example, 
in a typical implementation of mutual exclusion using a quorum system, pro- 
cessors request access to the critical section from all members of a quorum. A 
processor can enter its critical section only if it receives permission from all pro- 
cessors in a quorum.^ Work on quorum systems traditionally considered crash 
failures [1,2,4,5,6,7,8,14,13]. Malkhi and Reiter [9] proposed the interesting no- 
tion of Byzantine quorums - quorum systems that can tolerate Byzantine fail- 
ures. They showed that the traditional definition of quorums is not adequate 
to handle Byzantine failures: in the presence of Byzantine failures, the intersec- 
tion of two quorums should contain enough correct processors so that correct 

^ Additional measures are needed to insure that the implementation is fair and dead- 
lock free. 
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processors can unambiguously access the quorums. They presented protocols to 
implement a distributed shared register variable using Byzantine quorums. Their 
implementation requires a client accessing a quorum to wait for responses from 
every processor in a quorum set, but they did not study the problem of finding 
a quorum set whose elements are available - an available quorum. 

In this paper, we study the cost of finding an available quorum in the presence 
of Byzantine failures. We introduce non-blocking Byzantine quorum systems and 
show that they can be achieved at a low cost and we present non-blocking Byzan- 
tine quorum constructions for two failure models. The constructions we present 
are the first that do not require blocking and that have a low cost. Also, the 
construction we present for the Disjoint failure model yields a Byzantine quo- 
rum system that has better performance parameters than previously proposed 
systems. Our construction rely on a new access model we call partial access. 
With partial access, a processor need not wait for a reply from each process in 
a quorum set. The quorum system should be designed to ensure that any two 
partial accesses have a large enough intersection to ensure consistency. It turns 
out that the set of partial accesses of a non-blocking Byzantine quorum system 
is a Byzantine quorum system as defined in [9] . 

It should be emphasized that partial access is relevant for general quorum 
access in asynchronous systems and is not only relevant to asynchronous Byzan- 
tine quorum systems. In this paper, we only consider partial access for the case 
of Byzantine failures. The same methods we propose are applicable to cases in 
which the failures are not Byzantine, but that are restricted to a predefined fail- 
ure pattern. A probabilistic model of failure does not have a predefined failure 
pattern and therefore our results are not directly applicable to it. 

The rest of the paper is organized as follows. Section 2 discusses related work 
and Section 3 summarizes our contributions. Section 4 presents basic definitions 
and introduces the notion of asynchronous access cost. Section 5 gives examples 
of the asynchronous access cost of two Byzantine quorum systems. Section 6 
reformulates the definition of Byzantine quorums to capture the asynchronous 
access cost as a design objective. Section 7 presents non-blocking quorum systems 
with low asynchronous access cost and whose other performance parameters 
match those of the best Byzantine quorum systems proposed in the literature. 
Section 8 concludes the paper. 

2 Related Work 

The problem of finding an available quorum has been addressed by researchers 
for the case of detectable crash failures [2,13]. In [13], the probe complexity of 
a quorum system is defined. The probe complexity is the minimum number of 
processors that need to be contacted to establish the existence or non-existence 
of an available quorum. In the definition of probe complexity, processors can 
be probed incrementally and the identity of the processor to be probed next 
can depend on the responses received from previous probes. In [2], the author 
formally defined the concept of cost of failures^ which can be thought of as 
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the probe complexity per failures. Both [2] and [13] assume that failures can 
be detected. Their incremental access methods are not directly applicable to 
asynchronous systems subject to Byzantine failures. 

The problem of finding an available quorum in the presence of Byzantine 
failures has not been studied by other researchers. Due to the nature of Byzantine 
failures and system asynchrony, the definition of Byzantine quorums proposed 
in [9] requires that an available quorum exists in the system. Unfortunately, that 
requirement does not say anything about the cost of finding an available quorum. 
The availability requirement of Byzantine quorum systems was relaxed in [3] for 
systems in which timeouts can be used to detect failures. In such systems, any 
quorum set Q can be accessed without a need to access servers that do not 
belong to Q. 

3 Contributions 

To our knowledge, this is the first work that studies the cost of accessing Byzan- 
tine quorum systems in asynchronous systems. We define the asynchronous ac- 
cess cost of a Byzantine quorum and introduce non-blocking Byzantine quorum 
systems. Unlike Byzantine quorum systems, non-blocking Byzantine quorum sys- 
tems capture the asynchronous access cost as well the probe complexity. In that 
respect, they are similar to synchronous Byzantine quorum [3]. 

We propose optimal non-blocking quorum systems and show that they are not 
equivalent to previously proposed Byzantine quorum systems. For the disjoint 
failure pattern, we propose a non-blocking quorum system that yields the best 
known Byzantine quorum system for that failure model. 

4 Definitions and System Model 

4.1 System Model 

We assume that the system consists of a set V of n server processors and a 
number of client processors that are distinct from the servers. All processors 
can communicate using reliable message passing. We assume that there are no 
bounds on message delivery time or on processors speeds and that there are no 
failure detectors in the system. 



4.2 Failure Model 

Server processors can fail.^ The assumptions about failures affect the way a 
quorum can be used. In the Byzantine failure model, it is usually assumed that 
there is a known bound on the number of failures that can occur in the system 
and that failed processors do not recover. In this paper we adopt a model of 
failures introduced in [9] . The set faulty denotes the set of faulty processors in 

^ We do not consider client failures in this paper. 
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the system. A failure pattern T identifies the possible sets of faulty processors in 
the system. We write T = {Fi, F 2 , . . . , F^}. There exists an element F of F such 
that at any given instant, the faulty processors belong to F. The processors do 
not necessarily know F. A common example of a failure pattern is the f -threshold 
pattern in which F={FgF: |F|=/}. Another interesting failure pattern 

is the disjoint pattern in which all elements of F are disjoint [9] . 



4.3 Quorum Systems 

The standard definition of a quorum system is the following. 

Definition 1. A quorum system Q over V is a set of subsets (ealled quorums^ 
of V sueh that any two quorums have a non-empty interseetion. 

The intersection property of quorums is essential for their use in coordination 
problems. 

Processors access a quorum to coordinate their actions. The following is the 
typical way to access a quorum. A processor, the elient, sends a request to every 
processor, server^ in a quorum set. Upon receiving a request, a correct processor 
updates its state and sends a reply. The client waits until it receives a reply from 
every processor in the quorum. If the client receives replies from all processors 
in a quorum then the access is considered successful. If one of the servers failed, 
then the client attempts to access another quorum that does not have any faulty 
processor (the question of finding a quorum with no faulty processors has been 
addressed in [2,13]). Since processors access a quorum only if all its members 
are correct, two clients are always guaranteed to receive a response from a com- 
mon correct server which belongs to a non-empty intersection of two quorums. 
The correctness of quorum-based protocols rely on this intersection property. A 
quorum is said to be available if all its elements are correct processors [12]. 



4.4 Byzantine Quorums in Asynchronous Systems 

If failures are arbitrary, a processor might receive conflicting replies from faulty 
and correct processors. It follows that a processor must base its coordination 
decisions on replies that it knows to be from correct processors. Motivated by 
this requirement, Malkhi and Reiter gave the following definition [9]: 

Definition 2. A quorum system tolerates failure pattern F if 

1. VQi, Q 2 ^ Q VFi, F 2 G F : (Qi n Q 2 ) — Fi 2 ^2- 
VFGF3gGQ:FnQ = 0 

The first condition requires that the intersection of two quorums is not contained 
in the intersection of two sets in F. This guarantees that the replies of correct 
processors can be identified . To see why this is the case, consider, as an example, 
the /-threshold failure pattern. For that pattern, the condition reduces to a 
familiar- looking condition: Q\f^Q 2 > 2/-1-1. So, it is always possible to have /+! 
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identical replies in the intersection of two quorums. One of these replies must be 
the reply of a correct processor. 

The second condition requires that some quorum consists of correct proces- 
sors. The second condition is needed in asynchronous systems because there is no 
way to differentiate a slow processor from a faulty one. To access a quorum in an 
asynchronous system, a processor cannot simply send requests to all processors 
in a quorum set and waits for replies. In the worst case, even in a failure- free 
execution, a processor might have to send requests to every processor in the sys- 
tem and then waits for replies from a quorum that consists of correct processor 
(we give an example below). The availability condition (also called resiliency 
requirement in [10]) is needed to ensure that some quorum is available in the 
system. The work of Malkhi and Reiter [9] and their subsequent work [10] does 
not address the problem of ensuring that a response is received. Addressing this 
problem is an important contribution of this paper. 

4.5 Cost of Access 

In this section we introduce the asynchronous access cost of a quorum system. 
In asynchronous systems, a processor cannot use timeouts to detect failures. As 
pointed above, to access a quorum set, a processor cannot simply send requests 
to every element of the quorum set and then wait for replies. This is due to sys- 
tem asynchrony and the fact that failed processes might never reply. Byzantine 
quorum systems as defined in [9] get around this difficulty by requiring that for 
every faulty set F there exists a quorum set Q such that F H Q = 0 . It follows 
that one way to guarantee replies from some quorum is to send requests to every 
processor in the system and then wait for replies from some quorum. Obviously, 
sending requests to every processor is too costly and eliminates the benefits of 
using quorum systems. 

In practice, a system might not be fully asynchronous and quorums can be 
accessed without the need to contact every processor in the system. For example, 
a client can send request to a set of servers in a quorum and then send requests 
to more server if no response is received within some time limit, until a response 
is received from a quorum set. In this paper, we are considering the question of 
accessing quorums in systems that are fully asynchronous in which there are no 
bounds on message delivery delays or processors speed. For such systems, our 
aim is to improve on the worst-case scenario in which every server needs to be 
contacted to guarantee a response from some quorum set. 

In this paper, we consider the direct access model in which processes access 
a quorum by sending all requests at once and then wait for replies. The direct 
access model is not the most general model of accessing quorums. For instance, 
a process might incrementally access a quorum system by sending requests to 
some processes, then send further requests based on the replies it receives. We call 
such an incremental strategy a fault-tolerant access strategy (see [2] for a formal 
definition, also see [13] for a related notion). We believe that an incremental 
fault-tolerant access strategy is more appropriate for systems with detectable 
failures than for asynchronous systems with undetectable failures. In fact, an 
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incremental strategy would require more than one round of message exchange 
which can be prohibitively high in a fully asynchronous system. Studying the 
cost of access of incremental strategy is a subject for future research. 

Definition 3. Let Q he a Byzantine quorum system. A set A is an access set 
of a quorum set Q G Q if Q ^ A and 

yFeJ^3Q'eQ : Q' c A- F 

Note that a quorum Q might have more than one access set. 

Definition 4. Let Q be a Byzantine quorum system. The asynchronous access 
cost of a quorum set Q ^ Q is the size of the smallest aeeess set of Q. 

Note that by the definition of a Byzantine quorum system, the set V of all 
servers is an access set of each quorum set, and therefore the asynchronous access 
cost is well defined. 

Definition 5. The asynchronous access cost of a Byzantine quorum system Q 
is 

cost{Q) = min{|A| : VF G F3Q eQ : Q C A - F} 

In the direct access model, the asynchronous access cost gives the minimum 
number of servers that need to be contacted to ensure that a response is received 
from some quorum set. 

As the following theorem shows, in a fully asynchronous system, a client 
needs to send requests to cost{Q) servers each time it needs to access a quorum 
set. 

Theorem 6. Let Q he a Byzantine quorum system. In the direet aeeess model, 
a elient needs to send requests to cost{Q) servers to guarantee a response from 
eaeh server in some quorum set. 

Proof. In the direct access model, a client c sends all request at the beginning to 
some set of servers A. If |A| < cost{Q), then, by definition of the asynchronous 
access cost, there is a faulty set such that A — F contains no quorum set. If 
processes in F fail, c will not receive a response from every server in any quorum 
set. 



4.6 Strategies and Load 

This section presents the formal definitions of strategy and load as in [12]. It dis- 
cusses the implications of the asynchronous access cost on the load of a quorum 
system. 

A protocol using a quorum system chooses a quorum to access according to 
some rules. A strategy is a probabilistic rule to choose a quorum. Formally, a 
strategy is defined as follows. 
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Definition 7. Let Q = {Qi, . . . , Qm} be a quorum system. A strategy wE [0, 1]"^ 
for Q is a probability distribution over Q. 

For every processor q E strategy w induces a probability that q is 

chosen to be accessed. This probability is called the load on q. The system load 
is the load of the busiest element induced by the best possible strategy. 

Definition 8. Let w be a strategy for a quorum system Q = {Qi, . . . , Qm}- For 
any q eV, the load indueed by w on q is Iw{q) = Fg.^qWj. The load indueed 
by w on Q is 

Cw{Q) = max lw{q) 

qeV 

The system load on Q is 

C{Q) = min{£i„(C)}, 

W 

where the minimum is taken over all strategies. 

The definition of load implicitly assumes that no extra servers need to be 
contacted when a particular quorum is accessed. This is not the case for Byzan- 
tine failures because extra servers need to be accessed even in failure-free runs to 
guarantee a response (assuming clients do not know that the run is failure- free) . 
From the discussion about the cost of access, it follows that the load definition 
should take the cost of access into consideration. The definition of load can sim- 
ply be changed by replacing a quorum with the access set of the quorum, while 
allowing for different access sets for the same quorum at different times. If the 
only access set of any quorum is the set V of all servers, it follows that the load 
is 1, regardless of the quorum size. 



5 Asynchronous Access Cost Examples 

In this section we give examples of the asynchronous access cost for two Byzan- 
tine quorum systems. The first system, the Paths system, has optimal quorum 
size and load (as traditionally defined) combination and high availability in the 
presence of crash failures. We show that it has a large asynchronous access cost. 
The second system, the threshold system, has a small asynchronous access cost 
relative to the size of its quorums. 



5.1 Paths System 

The Paths system [10] is defined for the /-threshold failure pattern. It is defined 
as follows. Let n = be the number of servers arranged in a square grid of the 
triangular lattice. A quorum consists of a/2/ + 1 non-intersecting top-bottom 
paths and v^2/”+T non-intersecting left-right paths. In [10], it is shown that any 
two quorums intersect in 2f 1 distinct vertices and that the path system can 
tolerate no more than ^/n failures. 
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The Paths systems has small quorum size, small load (not taking access 
cost into consideration) and high availability in the presence of crash failures. 
Unfortunately, the asynchronous access cost of the Paths system is high as we 
show below. 



Lemma 9. cost(Path) = 0{{f + 

Proof. If a set A is of size \A\ < (/ + then A cannot contain more 

than (/ + a// + 1) — 1 disjoint left-right paths because each left-right path is 
of size at least d. It follows that a maximum cut set of A is of size at most 
(/ A — 1. Removing / vertices from the maximum cut set yields a set 

with cut set of size at most vT^FT — 1. Such a set cannot have disjoint 

paths. 

In particular, the access cost of the Paths system is very high if / = i?(d). 
Corollary 10. If f = Q{d), then cost(Path) = i?(n). 



5.2 Threshold System 

The threshold system is defined for the /-threshold failure pattern. A quorum 
of the threshold system consists of any set of / + [^^1- Any two quorums are 
guaranteed to intersect in at least 2/ + 1 elements. For the threshold system, the 
cost of the system is not much different from the quorum size, but the quorum 
size is large. 

Lemma 11. The asynchronous access cost of the Threshold system is 0(2/ + 

r^D- 

Proof In fact, a set A of size 2/ + vertices is guaranteed to contain a 

quorum if any / elements are removed from A. Also, a set A of size less than 
2/ -|- vertices is not guaranteed to contain a quorum if / elements are 

removed from A. 

6 Non-blocking Quorum Systems 

The goal of defining non-blocking quorum systems is to emphasize the impor- 
tance of the asynchronous access cost as a design parameter, as well as to provide 
a more uniform definition of Byzantine quorum systems. In the examples above, 
there is no clear relationship between the cost of access and the size of the quo- 
rum. Our aim is to reformulate the definition of the quorum system so that the 
cost of access is found as part of the design of the quorum system and not after 
the quorum is already designed. So, instead of designing the quorum sets, we 
directly design the access sets. We should note here that access sets have not 
been considered previously by other researchers. 

We define a non-blocking Byzantine quorum system as follows. 
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Definition 12 . A set system Q is a non-blocking masking quorum system that 
tolerates failure pattern T if and only if: 

VQi, Q2 ^ Q VFi, F2, F3, T \ {{Qi — Fi) n (Q2 — ^2) — ^3 2 ^4 

Note that this definition is similar to the first condition of Definition 2. In fact, 
given a non-blocking quorum system Q, the set{Q — : Q ^ Q A F e F} is 

a Byzantine Quorum system as defined by Malkhi and Reiter in [9] on the one 
hand. On the other hand, finding a non-blocking quorum system corresponding 
to a particular Byzantine quorum system is not straightforward always. 

We define partial access sets of a non-blocking Byzantine quorum system as 
follows. 

Definition 13. The partial access sets of a non-bloeking Byzantine quorum sys- 
tem Q that tolerates failure pattern T are the sets of the form Q — F , where 
Q e Q and F e F. 

Our definition requires that the client be able to determine a correct response 
from any partial access set; to have successful partial accesses, it is enough to 
guarantee that the quorum system can handle the worst-case failure scenario. 
To access a quorum Q, a client sends requests to all servers in Q, and then waits 
for a response from all servers in a partial access set of Q. Such a response is 
guaranteed by the definition of non-blocking Byzantine quorum systems. Once 
a response from a partial access set is received, the client can proceed as in [9]. 

The following theorem gives a sufficient condition for a collection of sets to 
be a non-blocking Byzantine quorum system. 

Theorem 14. A set Q is a non-blocking quorum system that tolerates failure 
pattern F if: 

1. VQi, Q 2 ^ Q V-Fi, F 2 ^ F {Qi n Q 2 ) — F 2 ^ F 2 , and 

2. VQi G Q VFi,F2 G F3Q2 G Q : Q2 ^ (Qi -Fi) UF2. 

Proof The proof is by contradiction. Let Qi and Q2 be two quorum sets such 
that 3 Fi,F 2,F3,F4 G F : ((Qi — Fi) H (Q2 — F 2 ) — F3 C F4 It follows that 
{Qi — Fi)n{Q2 —F2) F F3UF4 and {{Qi — Fi)AFs)n{{Q2 — F2)UF4) C F3UF4. 
By Condition 2 of the theorem, it follows that there exists two quorum sets Q 
and Q' such that QOQ' C F3 UF4. This contradicts Condition 1 of the theorem. 

As we saw, to each non-blocking Byzantine quorum system corresponds a 
Byzantine quorums system as defined in [9] . The importance of the reformulation 
lies in the fact that it ties the quorum size to the asynchronous access cost. 
When designing a non-blocking quorum system, one would have to design a 
quorum system with small quorums sets (all other parameters being equal), 
which means that the resulting asynchronous access cost is small. This is due to 
the fact that the quorums of a non-blocking quorum system are access sets of 
the underlying Byzantine quorum system. On the other hand, when designing 
Byzantine quorum systems as defined in the formulation of [9] , the asynchronous 
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cost of access is not directly related to the quorum size even if the quorum 
systems have good performance parameters. One might argue that it is always 
possible to construct Byzantine quorum systems with low access cost and without 
the need for a reformulation of the definition. Intuitively, we believe that this 
is not true unless the access cost is an explicit design parameter, in which case 
one has to use a definition similar to ours. Also, we believe that our definition 
provides a natural expression of the access cost as a design parameter. Finally, 
previous Byzantine quorum constructions in the literature did not take the access 
cost into account as we saw in Section 5. 

6.1 Existence of Non-blocking Quorum Systems 

Given a failure pattern, we are interested in deciding whether there exists a 
quorum system that tolerates the failure pattern. The following two propositions 
give necessary and sufficient conditions. 

Proposition 15. There exists a non-bloeking quorum system that tolerates fail- 
ure pattern T if and only if Q = {V} tolerates T . 

Proof. If Q = {V} tolerates then there exists a quorum system that toler- 
ates P. If there exists a quorum system Q that tolerates then there exists a 
quorum in Q that cannot be contained in the union of less than five elements 
of T . It follows that V is not equal to the union of less than five elements in T 
and that {V} tolerates T. 

Proposition 16. There exists a non-hloeking quorum system that tolerate failure 
pattern T if and only if 

V ^ AUBUCUD. 

Proof. Direct application of Proposition 15. 

It is interesting to note that the necessary and sufficient condition for the 
existence of a non-blocking Byzantine quorum system is also a necessary and 
sufficient condition for the existence of a quorum system [9]. This is expected, 
given the extremal nature of the system used in the proofs. 

Corollary 17. There exists a quorum system that tolerates the f -threshold failure 
pattern if and only z/ n > 4/ -h 1 . 

7 Non-blocking Quorum Systems Constructions 

Depending on the failure patterns, constructing a non-trivial non-blocking 
Byzantine quorum systems can be harder than constructing a Byzantine quorum 
system (the set Q = {V} is a trivial system if a non-blocking quorum system 
exist). In this section we present two constructions of non-blocking Byzantine 
quorum systems, one for the threshold failure pattern and one for the disjoint 
failure pattern. 
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7.1 Threshold Failure Pattern 

Consider a system oi n = dP processors arranged in a d x d square grid, d > 4 
in the presence of the /-threshold failure pattern, / < (n — l)/4. Two vertices 
and (^ 2 ,^ 2 ) of the grid are connected if: x\ = X 2 A yi = y 2 1, 
Xi = X2 A ^2 = di + 1, = ^2 + 1 A = ^2, ^2 = + 1 A yi = y2, 

= X 2 + 1 A = ^2 + 1, or X 2 = + 1 A ^2 = di + 1- 

Similar to the Paths system [10], we define a non-blocking quorum system 
Qfn that consists of disjoint left-right paths and 2\^/f~^PT] disjoint 

top-bottom paths. Using the same arguments as in [10], it is easy to show that 
any two quorums are guaranteed to intersect in 4/ + 1 elements. It follows that 
the quorum system is a non-blocking quorum system. The quorum system Qfn 
has better fault tolerance than the Paths system which can tolerate no more 
than in the worst case. The reason is that the Paths system requires that 
each row have a number of available vertices for the system to be available. In 
contrast, Qfn can function as long as the intersection property is satisfied and 
regardless of the availability in each row. 

More importantly, the cost of accessing Qfn is 4[\// -b l]d, compared to 
f^((/ + V f + fhe Paths system. 

Finally, the Byzantine quorum system Qf = {Q — F : Q ^ Qfn and F E F} 
is not directly related to the Paths system. In fact, many quorums of Q/ do not 
contain a quorum of the Paths system and vice versa. This can be easily shown 
by considering the discussion on the fault tolerance of the two systems and the 
fact that the quorums of the Paths system are smaller than those of Qf. This 
example shows the advantage of using the definition of non-blocking quorum 
systems. 



7.2 Disjoint Failure Pattern 



In this section we provide an efficient construction of a non-blocking quorum 
system Qdn for a failure pattern F whose elements are disjoint. The construction 
is similar to the construction given in [3]. We present the construction in some 
detail to show that our definition of non-blocking Byzantine quorum systems 
yields quorum systems that are not easily designed for the original definition. 
In fact, the Byzantine quorum system Qd defined by the partial access sets of 
Qdn outperforms the ones proposed in [9] for the disjoint failure pattern. Also, 
we do not know how to express Qd other than as the set of partial access sets of 
Qdn- We do not provide proofs for most of our claims because they are almost 
identical to those of [3] . 

Let F = {Fi, F 2 , . . . , Fm} be the set of failure sets ordered in decreasing size. 
We assume without loss of generality that m > 4. Let a = n — {Uf^pFi\). 

Our construction will proceed as follows. First we show that there are five 
disjoint sets of size greater than ^ such that no two of them will have an non- 
empty intersection with the same faulty set. Then, on each of the five sets Fi, 
i = 0, . . . , 4, we construct a traditional quorum system whose load is 0{ } ). 
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On we construct a quorum system whose elements consist of the union of five 
quorums, one from each of the five sets. 

Let mo = 4 and define i = 1, . . . , 4 as follows: 

mi = min{j : |Fi U [J Ffe| > ^} 

rrii-i<k<j-\-l 

Note that j and k are bound variables in the definition of m^. Also, 1 < i < 4, 
are always guaranteed to exist. 

Now, define the five sets 5'^, 0 < i < 4, as follows: 

- Si =Fi U -P’fc, if * < 4, and 

'^4 Um4</c 

Proposition 18. Si D Sj = 0 < i j < 4. 



Proposition 19. Si > ^for 0 < i < 4. 



Now, we describe the non-blocking quorum system Qdn- On 0 < i < 4, 
define a quorum system with load 0{ } ). Many such systems exist. One such 

V 1‘5'il 

system is the triangle lattice system [2]. In the triangle lattice system over 5'^, 
we can choose quorums such that each processor belongs to exactly two 

quorums. If we choose each quorum with a probability — , it follows that 



the load of the triangle lattice system on Si is at most 





. Define a 



quorum on V to be the union of five quorum sets one from each of the triangle 
lattices defined on 5'^, 0 < i < 4. 



Proposition 20. The resulting system is a non-bloeking quorum system that 
tolerates tF. 



Proof. In fact, any two quorums intersect in five servers, no two of which belong 
to the same faulty set. Therefore the intersection of two quorums does not belong 
to the union of four faulty sets. 

Let Qd be the Byzantine quorum system defined by the access sets of Qdn- 
Qd = {Q — F : Q G Qdn and F G F}. It follows that Qd tolerates the disjoint 
failure pattern F. 

Proposition 21. The load of the quorum system Qd is O(^). 



Lemma 22. The load of Qd is optimal. 

Proof. The proof uses the same techniques as those used in [3] and is omitted. 

The load of Qd stated above is the traditional load. Nevertheless, the load of 
Qdn is of the same order as the load of Qd and is therefore optimal. Finally, we 
know of no simpler way to express Qd or to obtain a Byzantine quorum system 
that tolerate F and has a better load. 
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8 Conclusion 

An important contribution of this paper is the recognition that in asynchronous 
systems, it is not enough to design a quorum systems with small quorum sets, 
but it is more important to design a quorum set that is amenable to efficient 
access. We proposed a new definition of Byzantine quorum systems that lend 
themselves to non-blocking access. We have shown that designing non-blocking 
Byzantine quorums with small access cost is possible in asynchronous systems. 

One might think that the only difference between Byzantine quorum systems 
and non-blocking Byzantine quorum systems is that of how access is achieved, 
and that it is possible to come up with a new definition of access for Byzan- 
tine quorum systems to achieve non-blocking access. While this is possible, the 
resulting non-blocking quorum system is not guaranteed to have a small access 
cost. To design Byzantine quorum system with a small access cost, we believe 
that a formulation similar to ours is needed. 

Finally, it is interesting to study incremental access strategies and the corre- 
sponding cost of access. We leave that as a subject for future research. 
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Abstract. This paper introduces a new adversary model for Byzantine 
agreement and broadcast among a set P of players in which the adversary 
may perform two different types of player corruption: active (Byzantine) 
corruption and fail-corruption (crash). As a strict generalization of the 
results of Garay and Perry, who proved tight bounds on the maximal 
number of actively and fail-corrupted players, the adversary’s capability 
is characterized by a set Z of pairs (A, F) of subsets of P where the 
adversary may select an arbitrary such pair (Ai,Fi) from Z and corrupt 
the players in Ai actively and fail-corrupt the players in Fi. 

For this model we prove that the exact condition on Z for which perfectly 
secure agreement and broadcast are achievable is that for no three pairs 
(Ai,Fi),{Aj,Fj), and (Ak,Fk) in >2 we have AiUAjUAkU{FinFjnFk) = P. 
Achievability is demonstrated by efficient protocols. Moreover, for a 
slightly stronger condition on Z, which covers the previous mixed (active 
and fail-corruption) threshold condition and the previous purely-active 
non- threshold condition, we demonstrate agreement and broadcast pro- 
tocols that are substantially more efficient than all previous protocols for 
these two settings. 

Keywords: Broadcast, Byzantine agreement, unconditional security, ac- 
tive adversary, fail- corrupt ion. 



1 Introduction 

Byzantine agreement and broadcast are two closely related fundamental prob- 
lems in distributed systems and cryptography, in particular in secure multi-party 
computation. In this paper we consider Byzantine agreement (and broadcast) 
protocols among a set of players in a standard model with a complete syn- 
chronous network of pairwise authenticated channels among the players. 

1.1 Player Corruption 

We demand the protocols to be perfectly secure (i.e. unconditionally secure with 
no probability of error) against an adversary that may corrupt players in two 
different ways: 
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Active corruption: The adversary takes full control over the corrupted players 
and makes them deviate from the protocol in an arbitrary way. 
Fail-corruption: At an arbitrary time during the protocol, chosen by the ad- 
versary, the communication from and to the corrupted player is stopped. 

The players that are fail-corrupted or uncorrupted are called non-malicious 
since they do not deviate from the protocol as long as they participate. A player 
is called correct if he is non-malicious and has not failed yet at the described 
point of time. Thus correctness describes a temporal property of the players and 
only a player that is correct at the end of the protocol is actually uncorrupted. 

For a fail-corrupted player to fail during some communication round means 
that he is correct up to this point and, during this round, stops communicating 
with at least one correct player.^ The player sends no messages during any 
subsequent round of the protocol. 



1.2 Byzantine Agreement and Broadcast 

A Byzantine agreement protocol is defined for a set of n players with every player 
initially holding an input value and finally deciding on an output value such that 
the following conditions are satisfied: 

Agreement: All uncorrupted players decide on the same output value. 
Validity: If all initially correct players hold the same input value v then all 
uncorrupted players decide on v. 

Termination: For all non-malicious players the protocol terminates after a fi- 
nite number of rounds. 

In contrast to agreement, broadcast is defined with respect to one particular 
player called the dealer who initially inputs a value. Again, every player decides 
on some output value. For broadcast the former agreement and termination 
conditions are still required. The validity condition transforms into 

Validity ': If the dealer is uncorrupted then all uncorrupted players decide on 
the dealer’s input value. 

Note that it suffices to focus on bit-agreement (or bit-broadcast) protocols 
where the domain of values is restricted to {0, 1} since protocols for any fi- 
nite domain with cardinality m can be easily obtained by applying [logm] bit- 
protocols in parallel. This does not change the round complexity and increases 
the communication complexity only by a factor [ log m] . 

^ Our model includes that in the round during which a player fails, some correct players 
may still receive a valid message by this player whereas others may not. Hence the 
correct players’ views about which players have failed can be inconsistent, and this 
must be taken into account in the design and analysis of the protocols. 
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1.3 Previous Work 

In the threshold model with an active adversary, Lamport, Shostak and 
Pease [PSL80,LSP82] proved that Byzantine agreement is achievable if and only 
if less than one third of the players are actively corrupted (t < n/3). For this 
model numerous protocols with optimal resilience have been proposed in the lit- 
erature [DFF+82,BDDS87,TPS87,FM88,BGP89,CW92,GM93], which all have 
communication and computation complexities polynomial in the number n of 
players. In the threshold model with an adversary that may only perform fail- 
corruptions, Lamport and Fischer [LF82] proved that agreement is achievable 
for any t < n. 

These results have been unified in [GP92] for the threshold model where an 
adversary is considered who may corrupt arbitrary t players, but at most b < t 
of them actively (the rest is only fail-corrupted). They proved that t-h26 < n is a 
tight bound on agreement to be achievable and proposed protocols with optimal 
resilience and polynomial complexities. 

In the more general context of secure multi-party computation, Hirt and Mau- 
rer [HM97] introduced the concept of a general adversary that is characterized 
by an adversary structure which is a set of subsets of the player set. The ad- 
versary may corrupt the players of exactly one of these subsets. For the same 
model, with respect to an active adversary, Fitzi and Maurer [FM98] proposed 
optimally resilient broadcast protocols with computation complexity polynomial 
in the size of the adversary structure and communication complexity polynomial 
in the number n of players. 



1.4 Contributions 

This paper unifies the models of [GP92] and [FM98] to the new model with a 
general adversary that may simultaneously corrupt some players actively and 
some other players to fail. For this model a tight condition on the adversary 
structure is proven for Byzantine agreement to be achievable. Efficient protocols 
are proposed for every structure that meets this condition. This condition for 
example guarantees that, quite surprisingly, agreement is possible among four 
players pi, p 2 , Ps, and p 4 if any player pi is actively corrupted and all the 
remaining players except P((i+i) mod 4) are fail-corrupted. 

Furthermore we present a protocol that, when restricting this model to the 
special cases of [GP92] and [FM98], is even more efficient than any protocol 
previously known for these special cases. 

Although all these results (tight condition and protocols) are only presented 
for agreement they immediately hold for broadcast as well since, with only minor 
modifications, a broadcast protocol can be easily obtained from any agreement 
protocol and vice versa. Our proposed agreement protocols can even be turned 
into a broadcast protocol with no loss of efficiency. 
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1.5 Definitions and Notation 

The player set is denoted by P = {pi, . . . and its eardinality hy n = \P\. The 
adversary is defined by a general adversary strueture Z which is a set of elasses 
(A, P) with A C P and F C P where the players of exactly one class (A, P) may 
be corrupted — actively corrupted for the players in A and corrupted to fail for 
the players in P. Without loss of generality we demand An F = ^ since active 
corruption is strictly more general than fail-corruption. 

A class {A',F') is eontained in a class {A,F) (written {A' ,F') C [A, F)) if 
A' C A and P' C A UP, and a class (A', F') is strietly eontained in a class (A, P) 
if it is contained, and A' C A or P' C P (written (A', F') C (A, P)). 

The adversary structure Z is defined to be monotone with respect to inclu- 
sion, i.e., 

(A, F)eZA (A', P') C (A, P) (A', P') G Z. 

The basis Z of an adversary structure Z is the set of all maximal elements of Z 

Z = {{A,F) e Z I /a(A',P') G Z : (A,P) c {A',F')}. 

A set A is called an aetive set of an adversary structure Z if (A, 0) G .2'. 
A set P is called a fail set of Z if (0, P) G Z. The set of all active sets of an 
adversary structure Z is denoted by Za- 

The following predicates Q{P^Z) and P(P, Z) on adversary structures Z with 
respect to a player set P will be needed later in this paper: 



g(P,Z) :=V(Ai,Pi),(A2,P2),(A3,P3) gZ: A1UA2UA3UP1 ^P 



P(P, Z) := V(Ai , Pi), (A2, P2), (A3, P3) G Z : Al U A2 U A3 U (Pi H P2 H P3) ^ P 

Note that Q(P, 2') implies P(P, 2') and that, because of symmetry, Q{P,Z) is 
equivalent to R{P^ Z) in the threshold case. 

2 Necessary Condition 

Given the tight bound of tp 2 b < n for the threshold model^ in [GP92] it might be 
obvious to conclude that Q{P^Z) is a necessary condition for the general model, 
i.e., for any three classes in Z the union of all active sets with one of the fail sets 
must not cover the full player set P. However, as a consequence of the (generally) 
asymmetric properties of general adversary structures, agreement can still be 
achievable when this bound is violated since the failure of one particular player 
may rule out certain classes of the structure to be selected by the adversary. 
Thus only a weaker condition can be proven to be necessary which we prove to 
be tight in the next section: P(P, 2'), i.e., for any three classes in Z the union of 
all active sets with the intersection of all fail sets must not cover the full player 
set P. 

^ i.e. three times the number of actively corrupted plus once the number of fail- 
corrupted players must be less than n 
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Theorem 1. For a set P of n players and any adversary strueture Z, R{P^Z) 
is neeessary for Byzantine agreement to he aehievable. 

Proof. For the sake of contradiction suppose that there is an agreement protocol 
for some adversary structure Z violating i 7 (P, Z). Hence there are three classes 
(^1, Pi), (^2, P2), (^3, P3) ^ ^ such that Ai U A2 U A3 U (Pi H P2 H P3) = P. 
Due to the monotonicity of the adversary structure we can assume without loss 
of generality that Pi = P2 = P3 =: P and that the sets Ai, A2, A3 and P are 
pairwise disjoint. One possible strategy for the adversary is to make all players 
in F fail at the beginning of the protocol. Hence this protocol can be easily 
modified into a secure agreement protocol for the player set P' = P \ P with 
respect to Z restricted to the players in P' since by assumption this protocol is 
correct even if no player in P ever sends any message. Since Ai U A 2 U A 3 = P' 
this contradicts the result of [PSL80,HM97] that agreement is impossible in this 
case. 

3 Optimally Resilient Protocols 

This section describes protocols for any player set P and adversary structure Z 
satisfying R{P,Z). 



3.1 Protocol Elements and Code Notation 

The protocols are constructed basically along the lines of the protocols of [GP92] 
which are based on several subprotocols. The main idea of these subprotocols is 
that every player enters them with his preferred value he inclines to decide on 
and exits them with an updated (potentially different) preferred value such that 
the following two conditions are satisfied 

Persistence: If all correct players enter the subprotocol with the same pre- 
ferred value then, after the execution of the subprotocol, all correct players^ 
still prefer this value. In other words, the subprotocol has no effect when 
agreement had previously been achieved. 

Consistence: In any case (also if the correct players enter the subprotocol with 
distinct preferred values) the values preferred by the correct players at the 
end of the subprotocol are consistent (in a way to be defined separately for 
each particular subprotocol). 

The effect of such subprotocols can be interpreted as getting the correct play- 
ers closer to a state of agreement whereas, once achieved, agreement cannot be 
reversed anymore by the corrupted players. In this paper, often a weaker form of 
consistence is used that depends on whether some fail-corrupted (i.e. non-actively 
corrupted) player fails during the execution of the according subprotocol. 

^ i.e. all players who are still correct — remember the temporal definition of correct- 



ness. 
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Conditional consistence: Consistence provided that no fail-corrupted player 
fails during the execution of the subprotocol in consideration. 

All pseudo code descriptions of protocols are stated with respect to the lo- 
cal view of one particular player. The complete protocols consist of all players 
executing their local codes in parallel. Variables that have no subscript (e.g. v) 
are stated with respect to an arbitrary player and variables with a subscript p 
(e.g. Vp) denote the corresponding variable of the particular player p. For every 
player p, two global^ variables are used throughout all subprotocols, Vp and Lp. 
Vp denotes the preferred value by player p. Lp is a set in which player p collects 
all players that he has detected to be corrupted (active or fail). L is initialized 
to the empty set and (for a correct player) will never contain any correct player. 



3.2 Value Unification 

This section describes the crucial subprotocol of the agreement protocol: 
MakeUnique. It satisfies the persistence property according to Section 3.1 and 
conditional consistence in a way that at the end no two correct players prefer 
distinct values in {0, 1} if no fail-corrupted player fails during the execution of 
the subprotocol. In order to achieve this the original bit domain is extended 
by an invalidity value 2. However, the preferred value v is still required to be 
in {0, 1} before the execution of MakeUnique. 

MakeUnique: 

1. SendToAll(i’); // i.e. send v to every other player 

2. L := L U I r G P I no value received from r or value outside {0, 1} 

3. := { r G P I r sent 0 } \ L; 

4. := { r G P I r sent 1 } \ L; 

5. if (C^,L) G Z then i’ := 0 

6. elseif (C°,L) G Z then v 1 

7. else V 2 

8. fi; 

If, at the end of MakeUnique, a player p holds some value Vp G {0, 1} we say 
that player p accepts value Vp. On the other hand Vp = 2 means that player p 
rejects any value from {0, 1}. More precisely, a correct player accepts a value 

V G {0, 1} exactly if, according to his view, agreement on v could have been 
achieved before the execution of MakeUnique. For v G {0, 1} we define v as 

V := 1 — V. 

Lemma 1 (Persistence of MakeUnique). If all eorreet players initially hold 
the same value v G {0, 1}, then after the exeeution o/ MakeUnique every eorreet 
player p holds the value Vp = v. 

with respect to p’s scope 
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Proof. Let p be a player who is correct at the end of MakeUnique. Since all 
correct players initially hold the same value v every such player either sends 
this value to p or fails during this communication round. Hence v will only be 
received from an actively corrupted player, and (Cf^Lp) G Z holds. Hence also 
iCf^Lp) ^ Z must hold since otherwise R{P^Z) would be violated (because 
P = Cf U Cf U {Lp n Lp)). Thus Vp = v after the execution of MakeUnique. 



Lemma 2 (Conditional Consistence of MakeUnique). If after the exeeution 
of MakeUnique, two correet players p and q hold values Vp ^ 2 and Vq ^ 2, 
respeetively, then either Vp = Vq or at least one fail- eorrup ted player failed during 
the exeeution o/ MakeUnique. 



Proof Suppose for the sake of contradiction that no fail-corrupted player fails 
between sending his value to the players p and q and that Vp = v and Vq = v 
for some v G {0, 1} and hence (Cf.Lp) G Z and {C^^Lq) G Z. We have P = 
CfUCfU Lp and hence P can be decomposed as 

c; u (c; n c,") u (c; \ c") u \ l,) u(Lp n l,) = p. (i) 

V V 

CC- =:A 

Since no fail-corrupted player failed during the execution of this protocol all 
players in A = {Cf \ Cq) U {Lp \ Lq) must be actively corrupted and hence 
{A^Lp n Lq) G Z must hold. Thus we have (Cf^Lp) G Z^ {Cq^Lq) G Z and 
(A, Lp n Lq) G Z^ and by Equation (1) U U A U {Lp fl Lq) = P, which 
contradicts R{P^Z). 



3.3 Agreement Protocol 



The agreement protocol consists of a loop over a sequence of statements where 
one single iteration of the loop can be interpreted in the following way: 

The players run the MakeUnique protocol in order to guarantee that no two 
correct players continue with distinct values in {0, 1}. In the next round all 
players report their (unified) values to every other player. Then every player 
accepts a value v G {0, 1} if, according to his view, at least one correct player re- 
ported on — otherwise he rejects by setting v := 2. Finally some distinguished 
player, called the king^ reports on his particular value v which is adopted ex- 
actly by those players who know that at least one correct player rejected after 
MakeUnique (which implies that agreement did not hold before this particular 
iteration of the loop). 
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Agreement (VAR v) (Agreement Protocol 1): 

1. L:=0; 

2. for z := 1 to [logn]n do 

3. k ((z — 1) mod n) + 1; // Assign king 

4. MakeUnique; / / Communication Phase 1 

5. SendToAlKz;) ; // Communication Phase 2 

6. L := L U I r G P I no value received from r or value outside {0, 1, 2} 

7. P* := { r G P I r sent ^ } \ P for z G {0, 1, 2}; 



8. 


if (D°,L) ^ 2 : then v:= 0 




9. 


elseif (D^,L) ^ Z then v 


:= 1 


10. 


else v 2 




11. 


fi; 




12. 


Pk (only): SendToAll(t)) ; 


/ / Communication Phase 3 


13. 


w := value received by pk\ 


// if no value is received then set zd :: 


14. 


if (D‘^,L) ^ Z then v := min(l,zz;) fi; 


15. 


od; 





Every single iteration of the for-loop can be seen as a subprotocol with per- 
sistence and conditional consistence properties according to Section 3.1. These 
properties are stated in the next lemmas. 

Lemma 3 (Conditional Consistence). At the end of any iteration of the 
for-loop with a eorreet king pk during whieh no fail-eorrupted player fails, every 
eorreet player p holds the same value Vp =v. 

Proof Consider some k^^ iteration of the for-loop with pk being correct and 
during which no fail-corrupted player fails. If all correct players replace their 
values V by min(l,re), we are done since all correct players receive the same 
value w from player pk . 

Suppose now that at least one correct player p ignores the value sent by the 
king since (Dp,Lp) G Z holds. Hence Vp ^ 2 since otherwise, according to the 
Lines 8 to 10 of the protocol, also (P^, Lp) G Z and (P^, Lp) G Z would hold in 
contradiction to R{P,Z). Let v := Vp. (Dp,Lp) ^ Z {y ^ 2 and Lines 8 to 10) 
implies that at least one correct player sent v during Communication Phase 2 
and, due to the (conditional) consistence of MakeUnique, no correct player sent 
V during the same phase. Hence, for every correct player q, can only contain 
actively corrupted players and hence {D^,Lq) G Z holds. 

Suppose, for the sake of contradiction, that there is a correct player q who 
enters Communication Phase 3 with a value Vq ^ v, i.e., {D^^Lq) G Z. P = 
DpU DpU DpU Lp can be decomposed as 

d; u {d; \ DD u {d; n Dp ud^ u (l^ \ zp u(l^ n zp = p. 

^ V " V " " 

= :Ai GD- =:^2 

All players in A = Ai U A 2 are actively corrupted since they sent either v or 
distinct values to the players p and q or failed^ in p’s view but not in qA view. 

note that we suppose no fail-corrupted player to fail during this loop 
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Hence (A, LpCiLq) G Z which leads to a contradiction with condition R{P, Z) 
since together with (Dp^Lp) G Z and {D^^Lq) e Z we have U U AU 
{LpnLq)=P. 

Thus every correct player q enters Communication Phase 3 with Lq) ^ Z 
and hence Vq = v = Vp. Since especially the king pk is correct every player who 
accepts pkS value accepts Vp^ = v = Vp. 

Lemma 4 (Persistence). If at the beginning of any iteration of the for-loop 
every eorreet player p holds the same value Vp = v ^ 2, then every eorreet player 
holds V at the end of the iteration even if some fail-eorrupted players fail. 

Proof. Due to the persistence property of MakeUnique (Lemma 1) every correct 
player p holds Vp=v after MakeUnique and hence, after SendToAll, (D'^^Lq) G Z 
and (Dp^Lq) G Z. Because of the condition R{P^Z) also {D^^Lq) ^ Z must 
hold. Thus every correct player p ignores the king in Communication Phase 3 
and holds value Vp = v dit the end of the loop. 

The following theorem together with Theorem 1 shows that the condition 
R{P, Z) is tight: 

Theorem 2. For a set P of n players and an adversary strueture Z perfeetly 
seeure Byzantine agreement is aehievable ifR(P^Z) is satisfied. For every strue- 
ture Z satisfying R{P, Z) there is sueh a protoeol with eommunieation eomplexity 
polynomial in n and eomputation eomplexity polynomial in \Z\.^ 

Proof. We first show by contradiction that, in Agreement Protocol 1, all uncor- 
rupted players finally decide on the same value. Thus assume that two uncor- 
rupted players decide on distinct values. Then, according to Lemma 3, there was 
no iteration of the for-loop with a correct king during which no fail-corrupted 
player failed. Let C(0) = P and C{i) denote the set of players that are still cor- 
rect at the end of iteration i of the for-loop, and let c(0) = n and c(i) = |C(i)|. 
We argue that during any n sequential iterations i = j, . . . , j n — 1 at least 
c(j — l)/2 fail-corrupted players failed. The failure of one single fail-corrupted 
player can prevent agreement for at most two iterations with a king from the set 
C(j — 1) — one correct king’s iteration and his own one^. Hence at least half of 
the players in C(j — 1) must have failed. Hence, for any I with 0 < / < [logn], 
c(ln) < c((l — l)n)/2 and c{ln) < c(0)/2^, i.e. for I = [logn] (after the last 
iteration of the for-loop) we have 

c([logn]n) < c(0)/2^^°^’^^ < c(0)/n = 1, 

in contradiction to the fact that at least two players are uncorrupted and hence 
c([logn]n) > 1. Hence there is a first iteration of the for-loop with a correct 

® Under the natural assumption that there exists an algorithm polynomial in n to 
decide whether a given class (A, F) is an element of the adversary structure Z, the 
computation complexity is also polynomial in n. 

^ after he has already failed during the correct player’s iteration 
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king during which no fail-corrupted player fails. After this iteration agreement 
holds (Lemma 3) and due to Lemma 4 agreement holds also at the end of the 
protocol. 

The validity and termination properties are obviously satisfied. The efficiency 
can be easily verified by code inspection. 



4 Efficiency Improvements by Early Stopping 

A major disadvantage of Agreement Protocol 1 of Section 3 is that the play- 
ers must continue to iterate the for-loop even if agreement on some value has 
already been reached. The goal of this section is to derive a protocol that can 
be terminated as soon as agreement is achieved, i.e., a protocol that terminates 
early if only few players are corrupted. This is achieved by some modifications 
of Agreement Protocol 1. 

However, a full description and correctness proof of our early stopping pro- 
tocols for condition R{P, Z) would exceed the limits of this extended abstract. 
Instead, we give an early stopping protocol with respect to the stronger con- 
dition Q{P^Z) which can be handled more easily.^ Moreover these protocols, 
when applied to the case of a mixed (active and fail-corruption) threshold ad- 
versary [GP92] or a purely-active non-threshold adversary [FM98] , are even more 
efficient than any previously known protocols for these special cases. ^ 



4.1 Protocol Modifications 

As a consequence of the somewhat stronger condition Q{P^Z) on P and Z, ex- 
plicit failure detection becomes unnecessary (i.e. L drops out of the algorithms). 
The main idea is to achieve the following property which is important for the 
correctness of the protocol: 

Stop-Implication: A correct player (only) stops early if it is guaranteed that 
every correct player already prefers the same value v and if it is guaranteed that 
even after his early stopping v is preferred by every correct player. 

Since the Stop-Implication is a property of the final agreement protocol its 
correctness is proven only later, in the proof of Lemma 10. In order for the 
subprotocols to still satisfy the persistence property, even if correct players stop 
early, the following rule is introduced. 

Substitution-Rule: Whenever, during any communication round, a player p 
expects a value x to be sent by a player q but does not receive any value, 

® Note that Q{P,Z) still implies the achievability bounds for [GP92,FM98]. 

^ in contrast to our early stopping protocols for R{P, Z) (not described in this extended 
abstract), which are less efficient than those of [GP92] for their special model. 
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then X is set to the value Xp that has been sent by himself during the same 
communication round. 

This rule together with the Stop-Implication guarantees that, after a correct 
player p has stopped early, every correct player q replaces any future message 
by p correctly as if p would still participate in the protocol. 



4.2 Value Unification 

Subprotocol MakeUnique of Section 3.2 can be simplified when condition Q{P,Z) 
is satisfied. Since MakeUnique will be applied in two different contexts, we use 
the variable parameter x in the following pseudo-code description. 

MakeUnique (VAR x ) : 

1. SendToAll(x); 

2. := { r E P I r sent 0 

3. := { r G P I r sent 1 

4. if G Za then x := 0 

5. elseif G Za then x := 1 

6. else X \—2 

7. fi; 

It is easy to see that together with the Substitution Rule, persistence (accord- 
ing to Lemma 1) is still satisfied. In contrast to the conditional consistence in 
Lemma 2 even unconditional consistence can be proven. 

Lemma 5 (Consistence of MakeUnique). If, after the execution of 
MakeUnique, two correct players p and q hold values Vp ^ 2 and Vq ^ 2, then 

Vp = Vq. 

Proof Let p and q he two correct players and, for the sake of contradiction, 
suppose that Vp = v ^ 2 and Vq = 1 — v = v. 

Suppose that no correct player has stopped so far during the protocol and let 
A (and F) be the sets of players that are actively corrupted (and fail corrupted), 
and hence Cf G Za, G Za and {A,F) G Z. Since a correct player sends the 
same value (in {0, 1}) to both players p and q we have (7^ U (7^ U A U P = P in 
contradiction to Q(P, Z). 

On the other hand, if any correct player has stopped the protocol before then, 
due to the Stop-Implication, the players p and q hold the same value Vp = Vq 
after MakeUnique because of the persistence property. 

This substitution value is well-defined since communication is symmetric in the sense 
that during any specific round all players report on their particular view of the same 
variable or fact. The only exception is the king’s round wherein only the king sends 
his preferred value. In this case simply the own preferred value is taken. 

Whenever it is argued that every correct player behaves in a certain way, only players 
that have not stopped yet are considered. 
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4.3 Unicast 

In order to enable a player to detect that all correct players prefer the same value 
(and even will after he stops), Communication Phase 2 of Agreement Protocol 1 
is replaced by the more powerful primitive Unicast. Note that the for- loop can 
be parallelized into one communication round. 

Unicast (VAR 

1. SendToAll(^); 

2. for / := 1 to n do 

3. B} value received from 



5. MakeUniqueCS'^); 

6. od; 

7. {pz G P I = 0 A = 0 }; 

8. pi {p; G P I = 1 A = 0 }; 

9. := {pz G P I = 2 A = 1 }; 

10. if P° ^ Za then u := 0 

11. elseif ^ Za then v 1 

12. else V 2 

13. fi; 

We say that player p accepts value v G {0, 1, 2} from player g' or, as a short 
hand, that p accepts {v, q) if q £ D^. The following lemma follows directly from 
the consistence and persistence properties of MakeUnique. 

Lemma 6 (Consistence of Unicast). The value v sent by a eorreet player p 
is aeeepted by every eorreet player q, i.e., p G D^. Moreover, if a eorreet player p 
aeeepts (v,r) for v G {0, 1} and r ^ P, then no eorreet player q aeeepts (2,r), 
i.e., Dq C (P \ Dp) for any two eorreet players p and q. 

Lemma 7 (Persistence of Unicast). If all eorreet players prefer the same 
value v G {0, 1} before the exeeution o/ Unicast then all players from whieh a 
eorreet player p does not aeeeptv are aetively eorrupted. In partieular {P\Dp) G 
Za (and henee Dp G Za), and Vp = v at the end o/ Unicast. 

Proof According to Lemma 6 the value of every correct player is accepted. It 
remains to show that even a value is accepted from every fail-corrupted player pp. 
A correct player either does not receive a value from pi during SendToAll and 
hence replaces this value by v or he still receives a value from pi but then this 
value must be v since pi was correct until the execution of Unicast and therefore 
preferred v by assumption. Hence P* = 0 for all correct players which persists 
after MakeUnique according to the persistence property. 
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4.4 Agreement Protocol 

Agreement (VAR v) (Agreement Protocol 2): 

1. for A; := 1 to n do 

2. MakeUnique(r') ; // Communication Phase 1 

3. Unicast (D° ,D^) ; // Communication Phase 2 

4. pk (only): SendToAll (r») ; // Communication Phase 3 

5. re := value received from 

6. if (i’ = 2 V ^ Za) then v := min(l,re) 

7. elseif {y ^2 A P\D'^ e Za) then stop 

8. fi; 

9. od; 

Persistence and consistence can be proven in a similar way as for Agreement 
Protocol 1 of Section 3. Moreover even unconditional consistence can be proven. 

Lemma 8 (Persistence). If, at the beginning of some for-loop, all correct play- 
ers prefer the same value v ^2 then they still do so at the end of the loop. 

Proof. According to the persistence of MakeUnique and Unicast ^ Za^ G 
Za and D‘^ G Za are satisfied for every correct player p and hence Vp = v and 
the king’s value is ignored by p. 

The following two lemmas (Lemma 9 and 10) are needed for the proof of the 
consistence property of Agreement Protocol 2. Lemma 10 assures that the Stop- 
Implication indeed holds — a fact which also the previous proofs for MakeUnique 
and Unicast rely on. 

Lemma 9. If a correct player p ignores the king^s value according to Line 6 of 
the protocol, then every correct player prefers the same value Vp before Commu- 
nication Phase 3. 

Proof. Suppose that p ignores the king since Vp = v ^ 2 (and hence ^ Za) 
and Dp G Za hold. Dp ^ Za implies that at least one correct player entered 
Unicast with value v and hence every correct player entered Unicast with 
value v OT 2. 

For the sake of contradiction suppose that some correct player q enters this 
phase with some value Vq ^ v. Due to Lemma 6 all values by correct players are 
accepted. Hence, for some set A of actively corrupted players and some set F of 
fail-corrupted players, we can write the player set as P = Dp U Dp U Dp U A U P 
which can be decomposed as 

Dl U [Dl \ Dl) U [Dl f\Dl)\JDl\jA\jF = P 

" V " " V " 

= =A2Ui^2 —^q 

where Ai U A 2 F A and P2 C P. Hence we get Dp G Za (since p ignores the 
king), Dq G Za (since Vq 7 ^ v) and {A,F) e Z in contradiction to Q{P,Z). 
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Lemma 10 (Stop-Implication). If a correct player stops early then every cor- 
rect player already prefers the same value v and will do so during every subse- 
quent communication round. 

Proof. Consider the first iteration of the for-loop in which some correct players 
stop. Let p be such a player and v = Vp ^ 2 he his preferred value. Player p’s 
stopping implies that the king’s value is ignored by p and hence every correct 
player q prefers the same value Vq = v hy Lemma 9. Due to Lemma 6 , C 
P \ Dp holds and since P \ G Za by the stop condition for player p we 
immediately get G Za- Hence every correct player q ignores the king’s value 
and Vp = Vq = V dX the end of the loop. By Lemma 8 agreement on this value 
persists for every further communication round. 



Lemma 11 (Consistence). At the end of any for-loop with a correct king all 
correct players prefer the same value. 

Proof. If any correct player has stopped so far then consistence follows by the 
Lemmas 8 and 10. Hence suppose that no correct player has stopped so far, and 
suppose some iteration of the for-loop with pk being correct. If all correct 
players replace their values v := min(l,re) we are done. Suppose now that at 
least one correct player p ignores the value sent by the king. Hence, by Lemma 9, 
every correct player prefers the same value Vp before Communication Phase 3. 
Hence especially the king prefers Vp and every correct player who replaces his 
value replaces it with Vp. 



Lemma 12. Let C be the set of players that are actively or fail- corrupted. Agree- 
ment Protocol 2 achieves agreement and all correct players terminate the protocol 
after at most |C| + 2 iterations of the for-loop. 

Proof. That Agreement Protocol 2 achieves agreement follows immediately by 
the Lemmas 8 and 11, and the fact that there is at least one iteration of the 
for-loop with a correct king. It remains to show that, at the end of the first loop 
which is entered by all correct players with the same value 7 ^ 2 , all correct 
players have stopped with value v. Suppose that all correct players enter some 
loop with the same value v. Due to the persistence property of MakeUnique they 
also enter Unicast with this value and hence, due to Lemma 7, they still hold 
this value after Unicast and P\D^ G Za- Hence every correct player stops 
according to Line 7 of the protocol. 



4.5 Optimizations 

Agreement Protocol 2 can be optimized in the following ways. 

I. Depending on the concrete adversary structure Z the for-loop does not nec- 
essarily have to run over all n possible kings since it is only required that at 
least one of the kings be correct. 
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II. Every correct player may stop the protocol immediately after the loop in 
which he plays the king because all correct players will start the next loop 
with his value. 

III. In order to save one communication round in each loop, Communication 
Phase 3 (i.e. king’s value distribution) can be integrated into Unicast by the 
king already computing his distribution value in advance after the SendToAll 
round of Unicast: 

1. D* := { e P I /?' = i } for i G {0, 1, 2}; 

r 0, if 5° ^ 2 a 

2. v.= I 1 , if i Za 

[ 2, else. 

This value v can be sent by the king pk already during the MakeUnique round 
of Unicast without harming the protocol’s correctness. In order to see this 
suppose the king to be correct. 

— If all correct players consider the king then they all prefer the same value 
at the end of this loop. The king is only considered if agreement did not 
hold at the beginning of the loop and hence it does not matter which 
value is sent by the king. 

— If at least one correct player ignores the king then for some v G {0, 1} 
Dp ^ Za holds for every correct player p by Lemma 9. But since Dp^ C 
Dp^ this is the value v = v that is sent by pk • 

Theorem 3. For any player set P and adversary strueture Z satisfying Q{P^Z), 
Agreement Protoeol 2 (by ineluding the optimizations of this seetion) reaehes 
agreement. Let C be the set of players that aetually misbehave in the protoeol (by 
failing or sending false values), then all eorreet players terminate the protoeol 
after at most 3{\C\ + 2) communieation rounds. 

Proof. The theorem follows by Lemma 12 and Optimization III of this section. 



4.6 Comparison with Previous Results 

Agreement Protocol 2 (with optimizations) can be applied to the threshold model 
in [GP92] as well as to the general active adversary model in [FM98] . 

The protocols of [GP92] for the threshold model with actively and fail- 
corrupted players involve 5(t + 1) communication rounds in the worst case. Pro- 
vided that only some c < b players are actually corrupted, then only 5(c + 2) 
communication rounds are needed (whereas early stopping is not proven for 
b < c < t). Our improvement of these protocols is two-fold. First, the worst 
case round complexity is only 3{t 1). Second, we achieve early stopping inde- 

pendently of any additional constraint on the number c of actually corrupted 
players, i.e., provided that some c <t players are actually corrupted, the round 
complexity is at most 3(c -h 2). 

In the general adversary model of [FM98] with only active player corruption 
the tight bound for broadcast and agreement to be achievable is that no three 
adversary sets Ai G Za {i G {1,2,3}) cover the player set P. This implies that 
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there is at least one player set S ^ Za of cardinality |5'| < since other- 
wise this condition would be violated. According to Optimization I, it therefore 
suffices to define the for-loop over the set S since this set contains at least one 
correct player. Hence Agreement Protocol 2 involves at most 3[§] < n + 2 com- 
munication rounds. Provided that c players are actually corrupted then only 
min(3(c + 2),3[|']) communication rounds are needed. In contrast to these re- 
sults the protocols of [FM98] need 2n communication rounds in order to achieve 
polynomial communication complexity. 
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Abstract. In this paper we study the randomness complexity needed 
to distributively perform k XOR computation in a t-private way using 
constant-round protocols. 

We show that cover-free families allow the recycling of random bits for 
constant-round private protocols. More precisely, we show that after 
an 1-round initialization phase during which random bits are distributed 
among the players, it is possible to perform each of k XOR computations 
using 2-rounds of communication. In each phase the random bits are used 
according to a cover-free family and this allows to use each random bit 
for more than one computation. 

For t = 2, we design a protocol that uses 0(n\ogk) random bits in- 
stead of 0{nk) bits if no recycling is performed. More generally, if t > 1 
then 0{kt^ logn) random bits are sufficient to accomplish this task, for 
t = for constant e > 0. 

1 Introduction 

Consider a set of n players V = {Pi,...,Pn}, each possessing a private in- 
put Xi^ who wish to compute a certain function / of these variables. A private 
protocol allows the players to distributively compute the value of the function 
/(xi, . . . , Xn), in such a way that at the end of the protocol each player knows 
the value of the function but no player has information about the other players’ 
input more than what can be inferred from its own input and from the value 
of the function. Obviously some players can collude together in order to infer 
information about other players’ inputs. A protocol is said to be t-private if any 
coalition consisting of at most t-players cannot learn additional information from 
the protocol execution. 

This problem has been widely studied in the last years and several informa- 
tion theoretic and computationally secure protocols have been pro- 
posed ([1,4,6,7,12,14]) and relations among the basic ingredients of private mul- 
tiparty computation have been established ([5,8,11]). It is well known that ran- 
domness is an essential resource for private protocols and this has motivated 
the quantitative study of randomness in the context of private protocols. In [5], 
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lower bounds on the amount of randomness for t-private protocols have been 
given. Moreover, the efficiency of a distributed protocol critically depends on 
the number of rounds of communication needed for its execution. Surprisingly, 
most of the work done in this field actually focused on minimizing the random- 
ness and the message complexity of the protocols, designing algorithms that run 
in a number of rounds that depends on the number of players n or on the secu- 
rity threshold t. For example [11] show how to use small sample spaces to reduce 
the number of random bits in private computations. However, the algorithm 
proposed requires a non-constant number of rounds of communication. Some 
of the papers exploring the case of constant round private protocols are [3,2]. 
In [3] constant round protocols are shown for the case of computational pri- 
vacy. In [2] the authors show that for any boolean function /, there exists an 
information-theoretic t-private protocol that runs in constant number of rounds 
using f2{tn\ogn) random bits. Notice that to run k different computations of 
the function /, f2{ktn\ogn) random bits are needed as opposed to 0{kt^ \ogn) 
random bits used by our protocol (see below). In [13], it is shown that it is 
possible to recycle the random bits used for a 1-private XOR computation to 
perform 0{n) 1-private XOR computations. 

In this paper, we show that recycling random bits in constant round private 
computation is possible not only for the case of minimal privacy but even when 
the more stringent requirement of t-privacy is imposed. Our proposed protocols 
consist of k-\-l phases: An initialization phase in which some of the players pick 
random bits and distribute them to all the other players and k computation 
phases. During the i-th computation phase, each player sends a single one bit 
message to a designated player for phase i that will then compute and announce 
the result of the computation. The choice of the random bits to use is based on 
the construction of families of subsets with the “cover-freeness” properties. 

For sake of ease of presentation, we start by describing a simple 2-private k- 
phase 2-round protocol that uses 0{ny/n) random bits and 2 rounds of communi- 
cation per phase. We will then show how to improve the randomness complexity 
of this protocol achieving a 0(nlog k). We show further how to use the random 
bits generated during the initialization phase according to a cover free family 
and present a t-private protocol that uses 0{kt^ \ogn) random bits and still runs 
in 2-round per phase. 

2 Preliminaries 

Let V = {Pi, . . . ,Pn} denote the set of players. Each Pi holds a vector of k 
private inputs Xi = G {0, 1}^. The players are connected by means 

of a complete network of private channels; these are channels in which only the 
sender can write and only the receiver can read the transmitted message. The 
player have unbounded computational resources and, they are honest, i.e., they 
follow the protocol, but curious in sense that at the end of the protocol some of 
them can collude together to infer information about the private input of other 
players. 
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The players wish to compute the k sums^ of their inputs, i.e., 

xor(x}, . . . , xi);xor{xj, . . . ,xl); . . .] xor{x'l, ...,x^). 

Notations. We denote random variables by capital letters, any value that a 
random variable can take by the corresponding lower letter and use the writing x 
to denote a vector. 

We denote by = (xj, . . . , G {0, l}’^, the vector of inputs for phase j, 
X = . . . , G {0, the vector of all the inputs. Moreover, let us denote 

hy f^^\x) = (/(x^), . . . , /(x^)), the vector of the value of the function in all the 
phases, the random variable describing the communication string seen by Pi 
during the phase j (including incoming and outgoing messages), and by Ci the 
communication seen during the execution of all the phases of the protocol. Sim- 
ilarly denote by (resp., C s) the communication string seen by a subset S of 
players during the phase j (resp., the entire execution) of the protocol. Moreover 
we denote by Ri (resp., Rs) be the random bits received (or distributed) by Pi 
(resp., players in S) during the entire execution of the protocol. 

Definition 1 (Privacy). A k-phase n-player protocol for computing a func- 
tion f is t-private if for any subset S ^ V of at most t players, for any input 
vectors x,y such that f^^\x) = f^^\y) and such that Xi = yi for each i e S, 
for any communication string c and for any random string rs, 

Pr [Cs = c\R = rs,X = x]=Pr [Cs = c\R = vs,X = y] 

where the distribution probability is taken on the random string of the other 
players. 



Definition 2. A d-random protocol is a protocol such that, for every input as- 
signment X, the total number of random coins tossed by all the players in every 
execution is at most d. 

In a distributed protocol, a round could be defined as the sequence: The 
players receive messages, execute some local computation, send some messages. 

Definition 3. A k-phase r -round protocol is a protocol such that, for every input 
assignment x and for any sequence of random coins tossed by the players, the 
number of rounds to compute the value of some function f in each phase is at 
most r. 

Let a and b be two /-dimensional vector on some finite field, we will denote 
by a • 6 = inner product of the two vectors. 

A 0-1 vector v of length / can be thought of as the characteristic vector of 
an ordered set (Pi, . . . , P/). With a slight abuse of notation, we say that Pi G v 
iff the Pth component in v is equal to 1. 



^ All the summation in this work are intended to be bitwise xor. 



Randomness Recycling in Constant-Round Private Computations 



141 



3 The Protocols 

In this section we start by describing the protocol for t = 2 and then we present 
the general protocol for t > 1. 



3.1 A 2-Private 0(n)-Phase 2-Round Protocol 

In this section, we present a simple protocol consisting of a 1-round initial- 
ization phase and k 2-round computation phases to perform k = 0{n) XOR 
evaluations. The protocol uses 0{n^/n) random bits improving on the protocol 
that uses 0{n^) random bits, 0(l)-rounds of communication, and consists of k 
independent executions of the protocol for computing the XOR. 

Although, we will later show a more efficient 2-private protocol that uses 
O(nlogn) random bits and then a protocol for the case t > 2. The protocol we 
present in this section contains all the main ideas of the more complex protocols 
in a crisp and compact way. 

The n players Pi,...,P^ are partitioned into k = 0{n) collecting play- 
ers Pi, . . . , P/c, d = 0{^/n) distributing players P/c+i, • • • , P/c+d, and n — d — k 
regular players. The distributing players are the only players that have random 
bits. Each of them, during the initialization phase, gives one bit to each other 
player (regular, collecting or distributing) and thus each player receives exactly d 
random bits. Each computation is associated with one collecting player with each 
collecting player serving for exactly one phase. 

Each player computes k = d{d — l)/2 bits, . . . , b[^\ by performing 

the xor of each possible pair of the random bits received during the initialization 
phase. We also assume that the players have agreed upon a standard enumeration 
of the pairs of random bits so that the j-th bits constructed by the players are the 
xor of the random bits generated by the same two distributing players. Phase j 
is thus associated to two distributing players and one collecting player which we 
call the phase- managers. 

Phase j can be described as follows: Each player P^, excluding the managers, 
computes + b\^\ i.e., the xor of his input for the j-th phase, and the 
random bit for this phase and sends the result to the collecting player. Each 
of the two distributing players Pi of the phase compute the xor of his input, 

(j) 

the bit 6) , and all the random bits they have sent to the players during the 
initialization phase. The collecting player compute the xor of all the messages 
received with his input and the bit b^j^^ computed and announces the result to 
the other players. 

A more formal description of the protocol is found in Eigure 1. It is easy 
to see that the result is correct since all the random bits cancel out when the 
messages received by the collecting player are summed up. 

The 2-privacy of the protocol derives from the following easy observations. 
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- Each coalition that does not include any collecting player has no information 
on the inputs of the other players; 

- Each coalition consisting of two collecting players has no information on 
the inputs of the other players. Indeed, the messages of each player at two 
different phases have been masked with different pairs of random bits; 

- Each coalition of one collecting and one distributing player does not get 
information on the inputs of the other players since at each phase the inputs 
of the player have been masked using random bits generated by two different 
distributing players. 



Protocol XOR(n,k) 

Initialization Phase 

0. Choose d such that (0 > fc- 

1. Each distributing player Pk-\-i,i = 1, 2, . . . ,d, generates uniformly and indepen- 
dently n random bits, . . . Tn \ and sends to player Pj. 

2. Denote by Vi = • • • , the vector of random bits received by Pi. 

3. Each player computes the sequence P = {vi , . . . ,Vk} of characteristic vectors of 
all the subsets of two elements on {1,2, ...,d} according to a standard enumer- 
ation of pairs. 

Phase j=l,2,. . . ,k 

1. Each distributing player Pi G Vj computes Vi • vj + 

2. Each distributing player Pi 0 Vj computes: J 

3. Each collecting or regular player Pi sends to Pj. 

4. Pj computes the sum of the messages received and announces the result. 



Fig. 1. A simple 2-private k-phase 2-round protocol. 



We stress that each collecting player collects messages at most once during 
the entire execution. This is because if a player works as collector during two 
different phases then he could cancel out by himself the random bits used in 
both phases by simply xoring the messages received by each single player. Since 
it holds that the sets of distributing players and collecting players are disjoint, 
the number, n, of players involved in the execution must be at least k d. 

Theorem 1. Protocol XOR{n^k) is a 2-private k-phase 2-round 0{n^/n)- 
random n-player protocol to compute the XOR function, for any k < n — 0{^/n). 

Proof. The correctness can be derived as follows. During phase j, collecting 
player Pj computes: 
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Let us consider the privacy condition. We will first consider the coalition S 
composed of two collecting players {21,^2} and then we will extend the proof 
to a coalition composed by a collecting player and another player (regular or 
distributing) and, by what has been claimed above, this will conclude the proof. 

W.l.o.g., let S = {1,2}. Let x,y be two vectors such that f^^\x) = f^^\y) 



- 



Jd) (1) 
■Cl C 2 5 



,C) be 



and Xi = yi for each i G {1,2}, let rs = 

the random string assigned to players in S', let Bi = (Mf , . . . , M{^), B2 = 
(M| , . . . , M2) and let be the random variable describing the values of the 
function. The communication string seen by the coalition is a sequence Cs = 
(Ml, . . . , Ml, Mf, . . . , Ml Bi, B2, FW) of 2 n + 2 {n- 2 )+k messages, i.e., the 
messages received by the players in the coalition during phases 1 and 2, the mas- 
sages sent by these players during the remaining phases and the values of the 
function during the entire protocol. Fix an arbitrary communication string c = 
(cl,..., 4, c 2 ,..., cl, 61, 62,/). Let Cl = 

(resp. = (cj, . . . , c^, c^, . . . , c^, 61, 62)), i.e., the communication seen be the 
coalition excluding the values of the function, then 



Pr[Cs = c\X = x,Rs = rs] = Pr[C^ = = f\X = x,Rs = rs] 

= Pr[F^'^^ = f\X = x,Rs = rs,C^ = cl] (1) 
Pr[C^ = c^\X = x,Rs = rs]- (2) 

Let us consider separately the two factors; Since f^^\x) = P\y), 
the factor (1) is equal to 

Pr[F^'^^=f\X = y,Rs=rs,Cl=cl]. (3) 



On the other hand, (2) can be rewritten as: 
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Pr[C^ X, Rs = rs] = 

= Pr[Mi = c\,. . . ,M^, = Cn,Mi =c\,...,mI = c„, 

Bi — bi,B 2 = 62 , \X = X, Rs = rs] 

= nt-,Pr[Ml = d,Mf = d Bi = bi\X = x,Rs = rs] ■ (4) 

nlLsPrlMl = c ] , Mf = 4 \X = x\. (5) 

This is because the messages sent by different players are independent since 
their inputs and the random bits they receive are independent and the messages 
sent by the players \nV\S are independent from the random bits received by 
the players in S. 

By hypothesis, Xi = yi for each i G S, hence we compute ( 4 ) as 

nl,Pr[Ml = c],Mf = cl Bi = bi\X =y,R = rs]. (6) 

For ( 5 ) recall that each message is the xor of the input and two random 

bits. Fix one player Pi e V \ S. The messages that Pi sends to Pi and P2 
are + ri + r2 and + rs + r4, with ri,r2,rs,r4 taken 

among the random bits Pi has received during the initialization phase. 

As Vi, V2 are two (distinct) 2 -subsets of { 1 , . . . , d}, the set with characteristic 
vector v\ is not a proper subset of V2 and vice versa. This implies that in {ri, r2} 
there is at least one element that does not belong to {rs,r4} and vice versa. As 
a worst case example, suppose that V2 and r4 are the same element. This means 
that ^ ^ + i?i + i?2 and + R2 + R3. 

If we fix the value of these messages to be mi, m2, for each possible input 
sequence = xf ‘^ , Pi can, starting from his inputs along with 

the value of R2 = r2, construct the messages mi, m2 by properly choosing the 
random bits ri , rs . In other words the messages sent by the players in 7 ^ \ S' to 
the ones in S are independent on the particular value of the inputs so we can 
write the following: 

77’L3Pr[M/ = c],Mf = c?|X = = n^= 3 Pr[Ml = c},Mf = c?|X = j/j. (7) 

By grouping together ( 3 ), ( 6 ) and ( 7 ) we have: 

Pr[Cs = c\x,rs] = Pr[Cs = c\y,rs]. 

To conclude, we have to prove the that theorem holds for a coalition composed 
by a collecting player, w.l.o.g. Pi and another non-collecting player Pj. Here we 
have two possible cases: Either Pj is a regular player or Pj is a distributing one. 
It suffices to notice that in both these cases Pj does not receive any message, thus 
the only messages the coalition can use are the ones received by Pi, i.e., -h 

Rl + i? 2 - 

Thus, if Pj is a regular player, then the coalition cannot cancel out any 
random bit, while if Pj is a distributing player, the coalition can cancel at most 
one random bit. Since each player uses at least two random bits during each 
phase, the privacy condition is still satisfied. 
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Let us now consider the randomness complexity of the protocol. Recall that d 
is the number of distributing players and n is the number of players involved 
in the protocol. We have chosen as the family of the characteristic vectors 
of all the 2-subsets of {1, . . . ,d} thus \T\ = k = = d{d — l)/2. Since n > 

k d = d{d — l)/2 + d = 0{d?)^ we have that d = 0{^/n). Recall that each 
distributing player selects n random bits, thus the total number of random bits 
used is n • d = 0{riy/n). □ 

3.2 Improving Randomness Complexity for 2-Private Protocols 

In the previous section we have given a simple /c-phase 2-round protocol that 
allows to reuse the same random bits for several computations. A better ran- 
domness complexity can be obtained if we use more sophisticated combinatorial 
structures. Indeed, notice that our proof of security of the previous section relies 
on the property that the symmetric difference between any two vectors Vi and Vj 
has weight at least 2. 

A Sperner family !F = {iSi, . . . on a ground set G = {gi , . . . is a 

subset of 2^, with the property that Si g 5'^, for each i ^ j. In this case we say 
that Sj does not cover Si. The size of the family !F is the number of elements 
in T . A Sperner family is t-uniform if it contains only t-subsets of the ground 
set G. 

Notice that the set of vectors used by the protocol XOR presented in Fig- 
ure 1 is a 2-uniform Sperner family on ground set {P/c+i, • • • , P/c+d}- In the fol- 
lowing, let P be a family of vectors of ground set {P/c+i, • • • , Pk-\-d} and k < \P\. 
Each Vi G J- is the characteristic vector of a subset of the ground set. A gener- 
alization of the protocol presented in Figure 1 is given in Figure 2. 

The proof of the following lemma uses similar argument of Theorem 1 and is 
thus omitted. 

Lemma 1. Let T he a Sperner family on the ordered set (P/c+i, . . . , of 

size d such that for each Vi ^ P it results that \vi\ > 2. Let k < \P\ such that 
n > k^d. Protocol XOR(n,P,k) is a 2-private k-phase 2-round n-player protocol 
to compute the XOR function using n • d random bits. 

By well known properties of the Sperner families, in order to maximize the 
size of P, we will use the d/2-uniform ones. Thus it is possible to write the 
following: 

Theorem 2. For each e > 0, all sufficiently large n, and k = {1 — e)n there 
exists a 2-private k-phase 2-round n-player protocol to compute the xor function 
that uses 0{n\ogk) random bits. 

Sketch of Proof. It is well known that there exists a Sperner family, over d 
elements of size Thus, we choose d as the minimum integer such that 

k < (p^/ 2 ])^ he., d = 0{logk) and construct such a Sperner family P to be used 
in the protocol XOR{n^ P, /c). □ 
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Protocol XOR(n,jF,k) 

Let T = {vi,V 2 , . . . ,Vk}. 

Initialization Phase 

1. Each distributing player Pk+j^ 1 < J ^ generates uniformly and independently n 
random bits, . . . Tn \ and sends to player Pi. 

2. Denote by Vi = • • • , vector of random bits received by Pi. 

Phase i =1,2,. . . ,k 

1. Each distributing player Pi G Vi computes Vi ■ vj + 

2. Each distributing player Pi 0 Vi computes: Vi ■ vj 

3. Each collecting or regular player Pi sends to Pi. 

4. Pi computes the sum of the messages received and announces the result. 



Fig. 2. A Generic Constant-Round Protocol Using a family 

4 A t-Private 2-Round Protocol 

In this section we focus our attention on the concept of a t-cover free family as 
introduced in [10]. A t-cover free family = {Si , . . . , 5'|jr|} on the ground set G 
is a subset of 2^, with the property that each set in the family is not covered by 
the union of t others. We will denote by a re-uniform t-cover free family. It 
is easy to verify that the following holds: 

Property 1. Let be a re-uniform t-cover free family of size k over a ground 
set of size d < k. Then, there exists a re-uniform t-cover free family of size 
at least k — d such that each element of the ground set belongs to at least 2 sets 
of 

The following property of t-cover free families will be used in proving the t- 
privacy of our protocol. 

Lemma 2. Let Tyj he a w-uniform t-eover free family over a ground set G, 
with w > t. Then, for any 1 < i < t and for any £ sets v\,...,Vi G Tw 
and t — £ sets u\, . . . ,Ut-i ^ G of size 1, for any v G we have that v ^ 

Ui=i ^ Uz=i 

Proof. Suppose that v C |Ji=i G Ui=i choose rei, . . . , Wt-i G J~w such 

that Ui C Wi and Wi ^ v. Such re^’s certainly exist since, by Property 1 each 
element of the ground set belongs to at least 2 sets of the family !Fyj. Now we 
have that v C Wi which contradicts the t-cover freeness of the 

family. 

A sketch of the proof of the following lemma is given in the appendix. 
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Lemma 3. Let be a w -uniform {t — l)-eover free family on a ground set 
of size d sueh that w > t. For any k < \Fw\, sueh that n > k F d, Protoeol 
XOR{n^ k) is a t-private k-phase 2-round n-player protoeol to eompute the 
XOR funetion using n • d random bits. 

In order to evaluate the randomness complexity of the protocol presented we 
recall the following: 



Lemma 4 ([9]). There is a deterministie algorithm that, for any fixed t,k, 
eonstruets a w -uniform t-eover-free family of size k with w = [d/4t] on a 
ground set of size d with 



d < 16t^ 



/ log(fc/2) 

V logs 



Note that it is possible to construct a t-cover free family Fw with w >t. 
By Lemma 3 and 4, it is possible to write the following: 



Lemma 5. For any n, t and for any k sueh that n> kpF log k, there exists an 
explieitly eonstruetible t-private k-phase 2-round n-player protoeol to eompute 
the XOR funetion that uses 0{nF \ogk) random bits. 



Thus we have the following: 



Theorem 3. Fort = and suffieiently large n, there exists a t-private 

k-phase 2-round n-player protoeol that uses 0{nF \ogk) if k < nj2 and 
0{kF \ogn) random bits if k > n/2. 

Sketeh of Proof . Lemma 5 actually proves the theorem for k < n/2. Let s = n/2. 
If k > n/2, we can run 0(k/ s) independent executions of the t-private 5-phase 
protocol using 0(kF logn) random bits. □ 



5 Conclusion 



We have shown how to run k disjoint 2-private XOR computations using only 
0{nlogk) random bits and 2 rounds per phase. One important feature of our 
algorithm is that each phase is run independently from each other, i.e., we do 
not require all k inputs to be available at the same time. We have also shown 
how to achieve, in constant round per phase, t-privacy for the /c-phase setting. 
We have presented a t-private /c-phase 2-round protocol that uses 0{nF log k) 
random bits improving on the 0{tkn) of the constant-round “naive” solution. 
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Appendix 

Proof Sketch of Lemma 3. The proof of correctness uses similar argument of 
Theorem 1 and is thus omitted. 

Consider an arbitrary coalition S of at most t players composed by i collecting 
players, 1 < ^ < t, and t — i distributing players. W.l.o.g., suppose that S = 
{Pi, . . . , P^, P/c+i, . . . , Pk-\-t-e}- The view of the coalition can be written as a 
sequence of n — jPI systems of linear equations in the unknowns 



(1) (1) , 
m • ^ = xl ^ Vi 


• Vi 


mf^ = xf^ + Vi ■ 


' Vi 






“i ~ ' i 





(8) 



Since the random bits distributed to different players are independent, we can 
consider the systems one at a time. (Actually, this is not completely accurate 
as we need to impose the condition that the sum of the inputs of each phase 
equals the sum of the inputs for that phase. However this detail can be easily 
accommodated.) By Lemma 2, for each 1 < j < /, there exists r, 
appears only in the j-th equality. Therefore we can write (8) as: 
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This shows that the view of the coalition consisting of 
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is independent from the vector of the inputs. 
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Abstract. In the contract-signing problem, participants wish to sign a 
contract m in such a way that either all participants obtain each others’ 
signatures, or nobody does. A contract-signing protocol is optimistic if 
it relies on a trusted third party, but only uses it when participants 
misbehave (e.g., try to cheat, or simply crash). 

We construct an efficient general multi-party optimistic contract-singing 
protocol. The protocol is also abuse-free, meaning that at no point can 
a participant prove to others that he is capable of choosing whether to 
validate or invalidate the contract. This is the first abuse- free optimistic 
contract-signing protocol that has been developed for n > 3 parties. We 
also show a linear lower bound on the number of rounds of any n-party 
optimistic contract-signing protocol. 



1 Introduction 

A contract is a non-repudiable agreement on a given text [7] . Contract signing 
is an important part of any business transaction, in particular in settings where 
participants do not trust each other to some extent already. Thus, the World 
Wide Web is probably the best example of a setting where contract signing 
schemes are needed. Still, even though a great amount of research has gone into 
developing methods for contract signing over a network such as the WWW, the 
case of contract signing by an arbitrary number of signatories was only recently 
considered, with most of the work concentrating on the case of two parties. In 
this paper we present a general n-party optimistic contract-signing protocol. 

Electronic contract signing. Contract signing is part of the broader problem of 
fair exchange [8,11,17,21,2]. More specifically, it can be considered fair exchange 
of digital signatures [4]. The term “contract signing” was first introduced in [7]. 

Early work on electronic contract signing, or more generally, fair exchange 
of secret s/signatures, focused on the gradual release of secrets to obtain simul- 
taneity, [6,15,19] (see [12] for more recent results). The idea is that if each party 
alternately releases a small portion of the secret, then neither party has a con- 
siderable advantage over the other. Unfortunately, such a solution has several 
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drawbacks. Apart from being expensive in terms of computation and communi- 
cation, it has the problem in real situations of uncertain termination: if the pro- 
tocol stops prematurely and one of the participants does not receive a message, 
he will never be sure whether the other party is continuing with the protocol, or 
stopped — and perhaps even engaged in another contract-signing protocol! 

More recently, considerable efforts have been devoted to develop efficient 
protocols that mimic the features of “paper contract signing,” especially fairness. 
A contract-signing protocol is fair if at the end of the protocol, either both 
parties have valid signatures for a contract, or neither does. In some sense, this 
corresponds to the “simultaneity” property of traditional paper contract signing. 
That is, a paper contract is generally signed by both parties at the same place 
and at the same time, and thus is fair. 

An alternative approach to achieving fairness instead of relying on the grad- 
ual release of secrets has been to use a trusted third party (TTP). A TTP is 
essentially a judge that can be called in to handle disputes between contract 
signers. This was the case in the early work of [14], where the contract is only 
valid if a “center of cancellation” does not object. The TTP can be on-line in 
the sense of mediating after every exchange as in [11,13,17], at the expense of 
the TTP becoming a potential bottleneck, or off-line, meaning that it only gets 
involved when something goes wrong (e.g., a participant attempts to cheat, or 
simply crashes, or the communication delays between the participants are intol- 
erably high, etc.). The latter approach has been called optimistie [2], and fair 
contract-signing protocols have been developed in this model [2,4,3]. Although 
the TTP is (by definition) trusted, it is, in some protocols, possible for the TTP 
to be accountable for his actions. That is, it can be determined if the TTP 
misbehaved. 

However, until recently, “fair” contract-signing protocols had the conceptual 
drawback of allowing a party (say, Alice) at some point to show to outside 
parties that she has a potentially signed eontraet (say, with Bob), as well as 
control over whether the contract will be validated. (In some cases protocols 
are even asymmetric, in the sense of not giving the other party (Bob) the same 
ability.) Alice may take advantage of this feature by coercing another party, say 
Carol, into another contract on the threat of signing the contract with Bob. and 
then cancelling the contract with Bob. (Bob would probably not appreciate this.) 
Another example of abuse would be in “We will beat your best deal” scenarios, 
in which Alice can take a validated price on some merchandise from vendor Val 
and present it to merchant Mary to achieve a better price, and later renege on 
the deal with Val. 

In [18] this type of attack was pointed out, and an efficient abuse-free opti- 
mistic contract-signing protocol was presented. Although not mentioned, a pro- 
tocol for fair exchange of digital signatures presented in [4] is also abuse-free; 
however, that protocol is inefficient in that it requires cut-and-choose techniques. 

In [23] the optimal efficiency of optimistic contract-signing protocols is con- 
sidered, and they give tight lower bounds on the message and round complexity 
of 2-party protocols. 
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Most of the protocols previously developed are for two-party contract sign- 
ings. In [18] an intricate protocol for three parties is presented. Independently 
and concurrently to our work, the case of multi-party contract signing was consid- 
ered in [1,5]. In [1], an optimistic multi-party contract-signing protocol is given 
for the case of synchronous networks, while in [5], an optimistic multi-party 
contract-signing protocol is given for asynchronous networks; these protocols, 
however, are not abuse-free. To our knowledge, there has been no proposal on 
how to produce general n-party abuse- free contract signings. 

Our results. We give a construction of an optimistic abuse-free contract-signing 
protocol for an arbitrary number of participants n. The construction is modular, 
requires 0{n^) messages and 0(n^) rounds of interaction (in an asynchronous 
environment), and tolerates an arbitrary number of faulty (Byzantine) parties. 
The protocol uses private eontraet signatures^ introduced in [18]. We also show 
a lower bound of n rounds for any optimistic contract-signing protocol (i.e., even 
if it is not abuse- free). 

Other related work. As pointed out in [23], contract signing bears some re- 
semblance to atomic commitment [25] and Byzantine agreement [22]. Contract 
signing is to be robust against malicious participants while providing a non- 
repudiable agreement; on the other hand, atomic commitment tolerates only 
benign failures (is unattainable otherwise) and does not require that partici- 
pants reach a decision (also, when they do, all decisions — including those of the 
faulty participants — must be consistent). The failure model of contract signing 
is similar to that of Byzantine agreement — although in the TTP version the 
trusted third party is assumed to remain uncorrupted. However, contract sign- 
ing achieves more than reaching agreement: all the participants should be able 
to prove it afterwards. 

Organization of the paper. The remainder of the paper is organized as follows. 
Section 2 describes the model and states the relevant definitions. Section 3 de- 
scribes the cryptographic tools that we need for the design. The multi-party 
abuse- free contract-signing protocol is presented in Section 4, and the proof of 
correctness and security is given in Section 5. Section 6 presents the efficiency 
analysis and the lower bound on the number of rounds. Section 7 offers some 
final remarks. 



2 Model and Definitions 

Our basic model is similar to that of [4]. We have a set of participants S = 
{Pi,P2, • • • ,Pn}, and a trusted third party T. Participants may be correct or 
faulty (Byzantine). Formally, the participants and the trusted third party are 
modelled by probabilistic interactive Turing machines. We assume all partici- 
pants have public/private keys which will be specified later. T acts like a server. 
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responding (atomically) to requests from the participants, with responses de- 
fined later. We assume that communication between any participants and T is 
over a private channel. 

The network model we consider is the same as in [4]. Namely, it is an asyn- 
chronous communication model with no global clocks, where messages can be 
delayed arbitrarily, but with messages sent between correct participants and the 
trusted third party guaranteed to be delivered eventually. In general, we assume 
an adversary may schedule messages, and possibly insert its own messages into 
the network. 

The participants, {Pi,...,P^}, wish to “sign” a contract m} By signing 
a contract, we mean providing public, non-repudiable evidence that they have 
agreed to the contract. The type of evidence is either universally agreed upon, or 
is part of the contract m itself. For instance, valid evidence may include either 
valid signatures by all participants on m, or valid signatures by all participants 
on the concatenation of m and the public key of T, signed together by T. 

Obviously, in the asynchronous model of [16], an adversary may prevent a 
contract from being signed simply by delaying all messages between players. 
Thus in order to force contract-signing protocols to be non-trivial, we specify 
a completeness condition using a slightly restricted adversary. An optimistic 
contract-signing protocol is complete if the (slightly restricted) adversary can- 
not prevent a set of correct participants from obtaining a valid signature on a 
contract. This adversary has signing oracles that can be queried on any mes- 
sage except m, can interact with T, and can arbitrariliy schedule messages from 
the participants to T. However, it cannot delay messages between the correct 
participants enough to cause any timeouts. 

An optimistic contract-signing protocol is fair if 

1. it is impossible for a set of corrupted participants, A, to obtain a valid 
contract without allowing the remaining set of participants, H, to also 
obtain a valid contract; 

2. once a correct participant obtains a cancellation message from the TTP T, 
it is impossible for any other participant to obtain a valid contract; and 

3. every correct participant is guaranteed to complete the protocol. 
Effectively, condition 2 provides a persistency condition, stating that correct 
participants cannot have their outcomes overturned by other participants. In 
previous two-party protocols, this condition was never explicitly discussed, al- 
though the condition was trivially satisfied by those protocols. With multi-party 
protocols, the condition is much more difficult to satisfy. 

We assume the corrupted participants have a signing oracle to obtain signa- 
tures from the remaining participants B on any message except m. The corrupted 
participants may also interact with T and the messages from H to T may be 
arbitrarily scheduled. 

^ In general, each participant might need to sign different pieces of text. Extension 
to this case is trivial. 
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An optimistic contract-signing protocol is abuse-free if it is impossible for 
any set of participants, A, at any point in the protocol to be able to prove to 
an outside party that they have the power to terminate (abort) or successfully 
complete the contract. 

We say an optimistic contract-signing protocol is seeure if it is fair, complete, 
and abuse-free. 



3 Cryptographic Tools 

3.1 Discrete Logs 

For a prime we will be working in a group Gq of order q with generator g. For 
a specific example, we could use a group form by integer modular arithmetic as 
follows. Let p be a prime of the form /g + 1 for some value I co-prime to g, and 
let Z* be the group, with g a generator of order g in Z*. Then Gq would be the 
subgroup generated by g. 

We assume the hardness of the Deeision Diffie- Heilman problem (DDH). In 
this problem, it is the goal of the polynomial-time adversary to distinguish the 
following two distributions with a non- negligible advantage over a random guess: 

1 - {gi,92,yi,y2) with 51,52,2/1,2/2 G,, and 

2- (91,92,91,92) with 51,52 Gr Gq and r €r Zg. 

The security of the digital signatures described in the next section relies on the 
DDH assumption. 



3.2 Private Contract Signatures 

Our protocols use private eontraet signatures (PCS), a type of digital signatures 
that were introduced in [ 18 ]. Roughly, the properties of these signatures can be 
summarized as follows: These signatures are: 

— Undeniable [ 9 ]: they are not self-authenticating or universally verifiable — 
they cannot be transferred; 

— designated- verifier [ 20 ]: only a selected receiver chosen by the signatory 
can verify them and convince himself of their validity; 

— designated-converter [ 10 ]: a third party (designated by the signatory) can 
convert them into self- authenticating signatures; and 

— non-interactive: the receiver of the signature knows beforehand that the 
third party can convert the signature, without the need for a verification 
session. 

That is, upon receiving a PCS, a party convinces himself of its validity, but 
cannot convince anybody else, and he also knows that the third party appointed 
by the signatory can convert it into a regular (self-authenticating) signature. 

For completeness, we include the full formal definition here. 
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Definition 1. A private contract signature (PCS) scheme U is a tuple of prob- 
abilistic polynomial-time algorithms {PCS-Sign, S-Convert, TP-Convert, PCS-Ver, 
S-Ver, TP-Ver} defined as follows, and having the security properties defined be- 
low. 



( 1 ) 



PCS-Sign executed by party A on m for B with respect to third party T, 
denoted PCS-Sign^(m, T); outputs a private contract signature, denoted 
PCSa(^, B,T). a private contract signature can be verified using PCS-Ver^ 



i.e.. 



PCS-Ver(m, A, B, T, 




true if S = FCSA{m,B,T); 
false otherwise. 



(2) S-Convert executed by A on a private contract signature S = PCSa(^, B, T) 
generated by A, denoted S-Convertyi(5'); produces a universally -verifiable 
signature by A on m, S-Sig^(m). 



(3) TP-Convert executed byT on a private contract signature S = FCSA{'m,B,T) , 
denoted TP-Con vertT (S'); produces a universally -verifiable signature by A 
on m, TP-Sig^(m). 



(4) S-Sig^(m) can be verified using S-\/er, and TP-Sig^(m) can be verified 
using TP-Ver; i.e.. 



and 



S-Ver(m, A, T, S) = 
TP-Ver(m,A,T, S) = 



true if S = S-Sig^(m); 
false otherwise; 



true if S = TP-Sig^(m); 
false otherwise. 



The security properties of a PCS scheme are: 

(1) Unforgeability of PCSa(^, B, T): For any m, it is infeasible for anyone 
but A or B to produce T and S such that PCS-Ver(m, A, B, T, S) = true. 



(2) Designated verifier property of FCS Ai^n, B,T): For any B, there is 
a polynomial-time algorithm FakeSign such that for any m. A, and T, 
FakeSign^(m, A, T) outputs S, where PCS-Ver(m, A, T, S) = true. 



(3) Unforgeability of S-Sig^(m) and TP-Sig^(m): For any m, A,B,T , as- 
suming P = FCSa{'i^^ B,T) is known and was produced by PCS-Sign^(m, 
B,T), it is infeasible 

(3.1) for anyone but A or T to produce S such that S-Ver(m, A, T, S) = 
true and TP-Ver(m, A, T, S) = true; 

(3.2) for anyone but A to produce S such that S-Ver(m, A, T, S) = true 
and TP-Ver(m, A, T, S) = false; and 

(3.3) for anyone but T to produce S such that TP-Ver(m, A, T, S) = true 
and S-Ver(m, A, T, S) = false. 



Definition 2. A PCS scheme B is third party-accountable if for any PCSa(^, 
B,T), the distributions o/ S-Sigy^(m) and TP-Sigy^(m); produced by S-Convert 
and TP-Convert; respectively, are disjoint and efficiently distinguishable. 
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That is, it is impossible to have S-Ver(m, A, T, S) = TP-Ver(m, A, T, S) = true 
for any S, and thus it is possible for a verifier to distinguish whether the conver- 
sion was performed by the signatory or by the third pary. This property might 
be useful if the signatories are concerned about the third party’s trustworthiness. 
A somewhat complementary property is the following. 

Definition 3. A PCS scheme U is third party-invisible if for any PCSa(^, 
B^T), the distributions of S-Sig^{m) and TP-Sigy^(m), produced by S-Convert 
and TP- Con vert, respectively, are identical. 

In other words, for any S, S-Ver(m, A, T, S') = TP-Ver(m, A, T, S), meaning no 
one can determine if a conversion was performed by the original signer or the 
trustee. This property might be useful in some scenarios where signatories may 
not want possible “bad publicity” associated with a contract needing to be con- 
verted by the third party. 

In [18] it is shown: 

Theorem 1 ([18]). Assuming the security o/DDH, there exists a PCS scheme 
(in the random oracle model). 

4 The Multi-party Contract-Signing Protocol 

4.1 Alternating Contract-Signing Protocols 

The basic structure of an optimistic 2-party protocol (e.g., [18]) is shown in 

Figure 1. (The protocol of [4] has the same structure, except that the underlying 
primitives are different, and provide different functionality.) The protocol has 
an alternating structure, with a previously agreed-upon initiator, say, party A. 
If no problems occur, A sends a “promise” (i.e., a private contract signature 
of Section 3) that she will sign the message to B. B verifies A’s PCS and gets 
convinced of A’s intentions — but cannot convince anybody else, and responds 
with his PCS. Then A converts her PCS into a universally-verifiable signature, 
and then B does the same. 

If problems occur, A or B may run protocols to abort or resolve the contract 
signing, depending on what step has been attained in the protocol. These proto- 
cols are run between A or B and T, the trusted party. Since A sends her message 
first and might not hear from B, or the PCS she receives could be wrong, the 
purpose of abort is to give A the power to terminate the protocol. If resolve is in- 
voked, the participant converts his own PCS into a regular signature, and sends 
it to T together with what he received from the other participant. Depending 
on the state of the protocol that it keeps, T grants an abort or converts the 
PCS into a regular signature and sends it back to the participant who invoked 
the resolve protocol. (T is able to perform the signature conversions given its 
designated-converter status.) It is shown in [18] that, assuming the security of 
DDH, the protocol of Fig. 1 is a secure contract-signing protocol for two parties. 

Notice that in the protocol only A can abort the protocol. The reason for this 
is that if also B could abort, then the abuse freeness condition would be violated. 
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Specifically, after receiving A’s converted signature, S-Sig^(m), B could prove 
to an outside party that he has a valid contract signed by A, as well as the power 
to terminate it. There is also a reason for the protocol being alternating (first A 
sends a message, then B does, and so forth), for if it were not (i.e., both parties 
could start “simultaneosuly” ) , then the same type of scenario as the one above 
could occur, with one party being able to prove that he has a valid contract. 



Signatory A Signatory B 

PCS.(.n,B,T) 

If ^k, abort 

^ Jf resolve 

If ^ok, resolve , 



Fig. 1. A secure contract-signing protocol for two parties 



It was pointed out in [18] that this type of protocol does not extend easily to 
the multi-party case; specifically, it is unclear whether to let the “middle” parties 
abort or not. In the next section we present a contract-signing protocol for n > 3 
parties that is secure. The approach we take is to let aborts be overturned, but in 
a way that avoids the violations of fairness and abuse freeness pointed out in [18] . 
Naturally, we use the private contract signatures of Section 3 as “promises.” 



4.2 Protocol Description 

Our protocol can be described recursively. For i participants Pi through Pi to 
sign a contract. Pi indicates its willingness to sign the contract to Pi through 
Pi_i, then participants Pi through come to an agreement about the con- 
tract (with promises, not signatures), letting Pi know about this (again with 
promises). Then all participants exchange promises to sign the contract. Only 
when all recursive levels are finished do the participants exchange signatures on 
a contract. 

To apply this recursive approach, we use different “strengths” of promises 
for the different levels of recursion. We denote the strength of a promise by an 
integer “level.” Specifically, an “i-level promise from A to P on a message m” is 
denoted PCSa((^, i), P, T). Our approach is shown in Figure 2. Note that the 
order of participants is reversed {Pi down to Pi) to match the recursion levels. 

If there is no misbehavior by any participants, there is never a need to contact 
the trusted third party T. If T is contacted, it needs to store some information 
to handle the abort and/or resolve messages it may receive. Specifically, if there 
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is a positive resolution (meaning the contract must be honored) it needs to store 
a completely signed contract (with the promises possibly converted to universal 
signatures by T itself). If there has not been a positive resolution, T needs to 
store an “abort list” containing indices of all the parties that have aborted, or 
tried to resolve unsuccessfully. T will also store the first abort message it received, 
along with another “forced abort list” that will be explain in the protocol below. 



Pi Pi-i ... Pi 



Distribute 1-level Promises 



{i — l)-level protocol 



I Collect (i — l)-level Promises | 



Exchange z-level promises 



Fig. 2. Secure contract-signing protocol: ith recursive level 

Protocol for participant Pg, 

(1) (Wait for all higher recursive levels to start) Pi waits for 1-level 
promises from P^+i , . . . ^Pn on m. If it does not receive them in a timely 
manner. Pi simply quits. 

(2) (Start recursive level i) Pi runs PCS-Signp. ((m, 1), Pj, T) for 1 < j < 
i — 1 and sends each PCSp. ((m, l),Pj,T) to Pj. 

(3) (Wait for recursive level i — 1 to finish) Pi waits for (i — l)-level 
promises from Pi, , Pi-i on m. If it does not receive them in a timely 
manner. Pi runs the Abort procedure. 

(4) (Send i-level promises to all parties) Pi runs P CS-S\gn p. {{m, i), Pj,T) 
for 1 < jf < i — 1 and sends each PCSp. ((m, i), Pj^T) to Pj. 

(5) (Finish recursive level i when i-level promises are received) Pi 

waits for Plevel promises from Pi, . . . ,Pz-i on m. If it does not receive 
them in a timely manner. Pi runs the Resolve procedure. 

(6) (Complete all higher recursive levels) For a = i + 1 to n, P^ does the 
following: 
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(6.1) Pi runs PCS-Signp. ((m, a — 1), Pj, T) and sends PCSp. ((m, a — 1), 
Pa,T) to Pa- 

(6.2) Pi waits for a-level promises from P,+i, Pa on m. If it does not 
receive them in a timely manner, Pi runs the Resolve procedure. 

(6.3) Pi runs PCS-Signp. ((m, a), Pj, T) for 1 < j i — 1 and sends each 
PCSp.((m,a),Pj,T) to Pj. 

(6.4) Pi waits for a-level promises from Pi, ... , Pi-i on m. If it does not 
receive them in a timely manner. Pi runs the Resolve procedure. 

(6.5) Pi runs PCS-Signp. ((m, a), Pj, T) for i + 1 < j < a and sends each 
PCSp,((m,a),Pj,f) to Pj. 

(7) Pi waits for signatures and n + 1-level promises^ from P^+i , . . . ^Pn on m. 
If it does not receive them in a timely manner. Pi runs the Resolve pro- 
cedure. 

(8) Pi runs PCS-Signp. ((m, n+1), Pj,T) and S-Convertp. (PCSp. ((m, 1), Pj,T)) 
for I < j < n, and sends each result (PCSp. ((m,n + l),Pj,T), 
S-Sigpj(m, 1))) to Pj. 

Protocol P^- Abort: To abort. Pi sends [m, Pp (Pi, . . . , P^), abort] p. to T. 

Protocol P^-Resolve: For Pi to resolve, it runs S-Convertp. (PCSp. ((m, 1), 

Pj^T)) (for any j G {1, • • • ,n}) to produce S-Sigp, ((m, 1)), and then sends the 

message 



[{PCSp. (( to, fcj), Pi, S-Sigpj(m, l))]Pi 

to T, where for j > i, kj is the maximum level of a promise received from Pj 
on m, and for j < kj is the maximum level of promises received from all 
participants Pj/ , with j' < i. (For example, if the maximum level of the promises 
received by P4 from P3 and P2 was 6, and the maximum level received by P4 
from Pi was 5, then it would send the 5-level promises for Pi, P2 and P3.) 

Protocol for T : For each contract^ m with participants Pi through P^^ when T 
learns about this contract (through an abort or receive message) it sets up a 
record that stores a boolean variable validated (m) which is true if the contract 
has been validated, and T has a full set of signatures (many would be converted 
by T from promises, most likely). If validated (m) is false, then T maintains the 
set Sm of indices of parties that have aborted, or resolved and failed to overturn 
a previous abort. T also maintains a set Fm of indices of parties that have 
not themselves aborted, but whom T will force to abort in certain situations. 
This set Fm is needed because at a particular point in our protocol, standard 
tests for whether to overturn an abort would allow unfairness. Specifically, they 
would allow groups of participants to conspire and receive a valid contract after 
a correct participant already received a cancellation from the T. 

^ We include these to simplify the TTP’s protocol. 

^ Without loss of generality, we may assume all contracts are distinct. 
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The protocol when T receives an abort message from Pi^ [m, P^, (Pi , . . . 
abort] p., is as follows: 

(1) If the signature is correct, and there has not been a positive resolution, 
T does the following. If i is larger than the maximum index in 5'^, T 
clears P^. In any case, T stores i in the set Sm - T sends [[m, Pj, (Pi, . . . , P^), 
abort] p^.]t and [m, abort] p to Pi. If this is the first player in the list, T 
sets validated (m) to false, and stores [[m. Pi, (Pi , . . . ,Pn), abort] pjp. 

(2) If there has been a positive resolution, T sends the valid (signed) contract 

to Pi, i.e., {S-Sigp^, ((m, where kj is the level of the promise 

from Pj that was converted to a universally- verifiable signature. 

The protocol when T receives a resolve message from Pi, 

is as follows: 

(1) If i G Pm, T ignores the message. 

(2) T checks to make sure all promises and signatures are consistent (i.e., m is 
the same, etc.) and valid. If not, then it ignores the message. 

(3) If there has been no previous query to T on m, it sends 

[{TP-ConvertT(PCSp^ ((m, kj), Pi, 

to Pi, stores all the signatures, and sets validated (m) to true 

(4) If there has been a positive resolution, T sends the stored valid contract 
to Pi, i.e., {S-Sigp^, ((m, /i^j))}jG{i,.--,^}’ where k'j is the level of the promise 
from Pj that was converted to a universally- verifiable signature. 

(5) If there has not been a positive resolution, T runs as follows: 

(5.1) If i ^ Fm^ then 

(5.1.1) if for any I G Pm there is a j G Pm such that j > ki, 
then T adds i to Pm, and sends the stored abort mes- 
sage [[m, Pj, (Pi, ... , Pn), abort]pJp and [m, Sm, abort]p 
to Pi. Let a be the maximum value in Pm. If a > i, then 
for all j with kj = a — 1, T adds j to Pm. If a = i, T 
clears Pm. 

(5.1.2) Otherwise, T sends 

[TP-Convertr(PCSp^- ((m, kj), P*, 

to Pi, stores all the signatures, and sets validated (m) 
to true 

(5.2) Else {i G Pm) 

(5.2.1) Let a be the maximum value in Pm. If for all i < j < 
a, kj > a and for all j < i, kj > a, then T sends 



[TP-Convertr(PCSp^- ((m, kj), P*, 
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to Pi, stores all the signatures, and sets validated (m) 
to true 

(5.2.2) Otherwise, T adds i to Smi and sends the stored abort 
message [[m, P^-, (Pi, . . . , P^), abortjpjr and [m, 
abort] T to Pi. Let a be the maximum value in Sm- If 
a > i, then for all j with kj = a — 1, T adds j to P^. If 
a = i, T clears P^. 



5 Protocol Correctness 

Theorem 2. Assuming the security of DDH, the protocol above is a secure 
optimistic contract- signing protocol (in the random oracle model), with TTP- 
invisihility if the PCS scheme is third-party invisible, and T TP- accountability if 
the PCS scheme is third-party accountable. 



Proof. Recall that we say that an optimistic contract-signing protocol is secure 
if it is complete, fair and abuse-free. 

Complete: Completeness follows directly from the definition of private contract 

signatures. 

Fair: To show fairness for Pi we must prove two results: 

(1) If any other party or parties can obtain a valid contract on m, so can Pi, 
and 

(2) If Pi gets a cancellation from T, then no other party or parties can 
obtain a valid contract. 

Result (1) can be seen from the following two cases: 

(1) Say Pi has sent out a universally- verifiable signature on m. Then Pi 
has {n -h l)-level promises from P^+i through P^, and n-level promises 
from Pi through Pi-i, so Pi can contact T and T will overturn any set 
of aborts even if i G Fm- 

(2) Say Pi has not sent out a universally- verifiable signature on m. Then 
by the security of the PCS, for some party to obtain a valid contract, 
some Pj would have to obtain the contract from T. Then Pi can send 
an abort or resolve message to T and obtain the contract also. 

Result (2) can be seen from the following cases: 

(1) Say Pi sends an abort to T in Step 3 of the protocol and receives a 
signed abort message back from T. Since Pi is correct, i > 1, par- 
ties Pi through Pi-i have a 1-level promise from Pi, and parties P^+i 
through Pn have no promise from Pi. Since T is honest, T stores i in Sm 
and will never allow any resolve request to overturn the abort, since 
any resolve request will have ki (the level of Ps promise) at most 1, and 
thus i > ki. 

(2) Say Pi sends a resolve request to T in Step 5 of the protocol and 
receives a signed abort message back from T. Since Pi is correct. Pi 
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must have sent (i — l)-level promises to T from Pi through and Pi 
through Pi-i have at most i-level promises from Pi. Since T is honest, 
there must be a j G Sm with j so T stores i in Sm and will 

never allow any resolve request to overturn the abort, since any resolve 
request will have ki (the level of i’s promise) at most i, and thus j > ki. 

(3) Say Pi sends a resolve request to T in Step (6.2) of the protocol and re- 
ceives a signed abort message back from T. Since Pi is correct. Pi must 
have sent (a — l)-level or a-level promises to T from Pi through Pi-i 
and Pi+i thorugh Pa-i, and Pi through Pi-i and P^+i through Pa 
have at most (a — l)-level promises from Pi. Since T is honest, there 
must be a j G Sm with j > a, so T stores i in Sm and will never allow 
any resolve request to overturn the abort, since any resolve request will 
have ki (the level of Fs promise) at most a — 1, and thus j > ki. 

(4) Say Pi sends a resolve request to T in Step (6.4) of the protocol and 
receives a signed abort message back from T. Since Pi is correct. Pi 
must have sent to T (1) a-level promises from Pi through Pi-i and (2) 
(a — l)-level promises from P^+i thorugh Pa-i* Also, Pi through Pi-i 
have at most a-level promises from Pi^ and P^+i through Pa have at 
most (a — l)-level promises from Pi. Since T is honest, there are two 
cases: 

(4.1) There is a j G Sm with j > a, so T stores i in Sm and will 
never allow any resolve request to overturn the abort, since 
any resolve request will have ki (the level of i’s promise) at 
most a, and thus j > ki. 

(4.2) There is no j G Sm with j > a, but a G Sm, and there is also a 
j' G Sm with j < i. In this case, T stores i in Sm and all j < i 
in Fm , and will never allow any resolve request to overturn the 
abort, since any resolve request will either have ki at most a — 1 
(and thus a > /c^), or will have ki = a and be from a party with 
index in P^. 

Abuse-free: To show abuse-freeness for Pi we must show that no party obtains 
publicly verifiable information about (honest) Pi signing the contract until Pi 
has the power to validate the contract over any other parties’ aborts. This 
follows by the designated verifier property of PCS, and the fact that once Pi 
sends S-Sigp, ((m, 1)), Pi will have received (n + l)-level promises from P^+i 
through Pn and n-level promises from Pi through P^-i, so Pi either has all 
the signatures from all other parties, or could contact T and obtain a valid 
contract over any other parties’ aborts. 

TTP-Invisibility or TTP- Accountability: Straightforward from the defini- 
tions. 

n 
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6 Efficiency 

By inspection of the recursive construction, it is easy to see that the number of 
messages required by our protocol is O(n^), and the number of rounds is 0 {n?). 

We now show a lower bound in the number of rounds which holds even if the 
contract-signing protocol is not abuse-free. 

Theorem 3. Any complete and fair optimistic contract- signing protocol with n 
participants requires at least n rounds in an optimistic run. 

Proof. (Sketch) Since the optimistic protocol is complete, there must be a way for 
the n participants to obtain a universally- verifiable contract without contacting 
the TTP. Thus there must be some party, say that sends a message during 
some round containing information that can be used along with information 
provided by other participants to validate the contract. At this point, since 
the protocol is fair. Pi must have received messages in previous rounds from 
other participants, say P2 through P^, such that regardless of the actions of 
those participants. Pi could send a message to the TTP and obtain a validated 
contract. 

Specifically, there must be a previous round in which a participant, say P2, 
sends a message to Pi that allows this. At this point, since the protocol is fair, P2 
must have received messages in previous rounds from participants P3 through P^ 
so that regardless of the actions of those participants P2 could send a message 
to the TTP and obtain a validated contract. (Otherwise, P2 could not obtain a 
validated contract, but Pi would be able to (possibly much later), and thus the 
protocol would not be fair for P2.) 

In a similar manner, given that a set of participants, say Pi through Pi have 
received messages so that for each j G participant Pj could send a 

message to the TTP and obtain a validated contract regardless of the actions 
of participants through P^, there must be a previous round in which a 

participant, say P^+i, sends the message that allows this to Pi. 

Therefore, by an inductive argument, we show the number of rounds is at 
least n. □ 

7 Conclusions 

In this paper we presented an optimistic abuse-free contract-signing protocol 
for an arbitrary number of participants n. The protocol tolerates an arbitrary 
number of faulty (Byzantine) parties, and requires 0 {n^) messages and O(n^) 
rounds of interaction (in an asynchronous environment). Waidner [ 26 ] recently 
suggested a way to add abuse freeness to any optimistic contract signig protocol 
using verifiable encryption. Applying this transformation to the protocol of [ 5 ] 
yields an abuse- free optimistic contract signing protocol requiring 0 {n) rounds. 



4 



The indices here are unrelated to the indices used in the protocol of Section 4 
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Abstract. Peterson’s n-process mutual exclusion algorithm [P81] has 
been widely touted for elegance and simplicity. It has been analyzed ex- 
tensively, and yet certain properties have eluded the researchers. This 
paper illustrates, and expands on, several properties of Peterson’s algo- 
rithm: (1) We reiterate that the number of processes that can overtake a 
process, called unfairness index, is unbounded in Peterson’s algorithm; 
(2) With a slight modification of the algorithm, we obtain the unfair- 
ness index of n{n — l)/2; (3) We identify an inherent characteristic of 
that algorithm that sets the lower bound of n(n — l)/2 for the unfair- 
ness index; (4) By modifying the characteristic, we obtain algorithms 
with unfairness index (n — 1); (5) We show that the new algorithms are 
amenable to reducing shared space requirement, and to improving time 
efficiency (where the number of steps executed is proportional to the 
current contention); and (6) We also extend the algorithms to solve l- 
exclusion problem in a simple and straightforward way. 



1 Introduction 

We assume a system of n independent cyclic processes competing for a shared 
resource R. In a process p, the part of the code segment that accesses R is 
called a critical section (CS) ofp for the resource R [D65]. The mutual exclusion 
problem is to design an algorithm that assures the following properties : 

— Safety : At any time, at most one process is allowed to be in the CS. 

— Liveness : When one or more processes have expressed their intentions to 
enter the CS, one of them eventually enters. 

In addition, it is desirable to have the following property. 

— Freedom from Starvation : Any process that expresses its intention to enter 
the CS will be able to do so in finite time. 

* This research is supported in part by the Natural Sciences and Engineering Research 
Council of Canada Individual Research Grant OGP0003182. 
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The following assumptions are made in any mutual exclusion algorithm. 
Assumption 1. The execution speed of any process is finite but unpredictable. 

Assumption 2. Critical section execution time of any process is finite but 
unpredictable. 

The mutual exclusion problem is one of the important problems in concurrent 
programming. Dekker was the first to give a correct software solution to mutual 
exclusion problem for two processes case. Dijkstra proposed a solution for n 
processes [D65]. Following Dijkstra’s solution, many solutions appeared in the 
literature for n processes. Peterson presented an elegant solution for n processes 
in 1981 [P81]. Ever since the publication of Peterson’s algorithm, it has become 
de facto example of mutual exclusion algorithm for many researchers and most 
text books on operating systems and concurrent programming, due to its sim- 
plicity [P81, R86, BD93, S97, VFG97] and elegance [D81,H90,S97,VFG97,T98]. 
Many correctness proofs for this algorithm have been presented in the litera- 
ture [D81,R86,H90,VFG97]. 

That Peterson’s algorithm satisfies safety, liveness, and freedom from starva- 
tion properties can be proved easily. However there is some confusion about the 
“fairness” of Peterson’s algorithm. 

When several processes are competing for the GS, a process may overtake 
another process, which started competing for the CS earlier, in entering the CS. 
As an example, assume n seats (shared resource) are available for a concert 
when a request r is submitted to reserve a seat. If the algorithm allows high 
overtakings, then all the n seats may be taken by the requests which arrive 
after r, and r is left unserved. This may occur due to different execution speeds 
of the processes. Such overtakes are normally considered unfair from the user 
point of view since the speed of a process may be constrained by the system’s 
internal state and may not be an inherent property of any particular process. 

We formalize the unfairness notion as follows. 

Definition 1. If a process i completes its first write operation on a shared vari- 
able of a mutual exclusion algorithm A before a process j does and subsequently j 
completes its CS execution before i does, then the process j overtakes the 
process i in CS access. 



Definition 2. In an algorithm A, the maximum number of possible CS access 
overtakes over a process by other processes is called the unfairness index of 
the algorithm A. 



Definition 3. In an algorithm A, the maximum number of CS access overtakes 
a process i can make over another process j is called the overtake index of 
the process i over j. The maximum of the overtake indices of i over other 
processes is called the overtake index of i in A. The maximum of the overtake 
indices of all the processes is called overtake index of the algorithm A. 
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The unfairness index of an algorithm A indicates the worst possible amount 
of denial of service that the algorithm may cause to any process, and the overtake 
index indicates the best possible amount of favor of service that the algorithm 
may give to any process. 

Raynal [R86] calculates the unfairness index of Peterson’s algorithm as 
n(n—l)/2. Kowaltowski and Palma [KP84] believe it to be n — 1 and Hofri [H90] 
shows it to be n — 1 under certain liveness assumptions. However, it is widely 
known that the unfairness index of Peterson’s algorithm is unbounded. (Prob- 
lem 10.12 in Lynch’s book [L96], page 329, asks for checking whether the un- 
fairness index is bounded, stated as whether hounded bypass is guaranteed. Un- 
bounded bypass of the n-process tournament algorithm, whose components are 
Peterson’s algorithm for the case n = 2, is shown on page 293 in that book.) We 
reiterate in this paper that the unfairness index is not bounded, by showing that 
the overtake index is not bounded. We then show that by slightly modifying 
Peterson’s algorithm a bounded unfairness index of n{n — l)/2, and overtake 
index of (n — 1) can be obtained. 

We identify an inherent characteristic of Peterson’s algorithm that sets the 
lower bound for the unfairness index as n{n — l)/2. By changing this charac- 
teristic, we obtain new mutual exclusion algorithms and achieve overtake index 
of 1 (that is, no process can overtake any other process more than once) and 
unfairness index of (n — 1). 

The new algorithms suggest a way of reducing the shared space requirements. 
We give two space-efficient algorithms, one with the characteristic of Peterson’s 
algorithm and unfairness index n(n— 1)/2 and the other with a new characteristic 
and unfairness index (n — 1). These two algorithms bring out the difference 
between the two characteristics in a clear-cut manner. 

We also find ways of reducing the number of instructions a process has to 
execute before entering the CS. For our improved algorithms, this number is 
proportional to the contention for the CS. 

We also show that many of our algorithms can be extended in a very sim- 
ple manner to solve /-exclusion problem [FLBB79, ADGMS94, AM94]. Here 
upto / (/ > 1) processes can be in the critical section simultaneously. This may 
be regarded as a resource allocation problem, with / identical copies of a non- 
sharable resource, where each process can request one copy of that resource. The 
algorithms proposed in the literature either assume higher level synchronization 
primitives, test- and- set in [FLBB79], and atomic queue procedures, aequire and 
release [AM94], or use a concurrent timestamp system [ADGMS94]. Designing a 
concurrent timestamp system itself is a nontrivial problem [DS97]. In contrast, 
our algorithms are simple and straightforward extensions of mutual exclusion 
algorithms. It is noteworthy that all our new algorithms preserve the elegance 
of Peterson’s algorithm. 
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In section 2, Peterson’s algorithm is given and its unfairness aspect discussed. 
In section 3, new algorithms with unfairness index n — 1 are given. Space efficient 
algorithms are presented in section 4. Time efficient algorithms are given in 
section 5. Section 6 deals with /-exclusion algorithms, and section 7 presents the 
concluding remarks. 

2 Peterson’s Algorithm and Bounding its Unfairness 
Index 

The basic idea behind Peterson’s n-process mutual exclusion algorithm [P81] is 
that each process passes through n — 1 stages before entering the critical section. 
These stages are designed to block one process per stage so that after n — 1 stages 
only one process will be eligible to enter the critical section (which we consider 
as stage n). The algorithm uses two integer arrays step and pos of sizes n — 1 
and n respectively: pos is an array of 1-writer multi-reader variables and step is 
an array of multi- writer multi-reader variables. The value at step[j] indicates the 
latest process at step j, and pos[i] indicates the latest stage that the process i is 
passing through. (Peterson uses Q for pos, and TURN for step.) The array pos 
is initialized to 0. The process id’s, pids, are assumed to be integers between 1 
and n. The code segment for process i is given in Figure 1^. 



Process i: 

1 . for j = 1 to n — 1 do 

2. begin 

3. pos[i\ :=j; 

4- step[j] :=i; 

5. wait until (V/c ^ i,pos[k] < j) 

V{step[j] ^ i) 

6. end; 

7. es.i; 

8. pos[i] := 0; 

Figure 1 : Algorithm PHI [P81] 



Process i: 

1 . for j = 1 to n — 1 do 

2. begin 

3. pos[i] := j; 

4- step[j] :=i; 

5. wait until (Vfc ^ i,pos[k\ < j) 

V{step[j] ^ i) 

6. end; 

7. wait until (V/c ^ i, (pos[k] = 0) 

\/ {step\pos[k]\ = k)) 

8. es.i; 

9. pos[i] 0; 

Figure 2 : Algorithm PH2 



We use the following terminology. 

— A process p is said to be at stage j if pos[p] = j. A process is said to be 
blocked at stage j (1 < j < n — 1) if it is waiting for a condition to become 

^ The labels of all the algorithms in this paper start with P, to indicate ‘Peterson- 
style’. The meaning of the second letter will be indicated later on. 
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true at stage j. A process is blocked if it is blocked at some stage j; it is 
unblocked otherwise. 

— If the condition step[j] ^ p is true (after the process p sets step[j] = p), then 
we say that the process p has been pushed (out of stage j, to enter stage 

j + !)• 

— A process p that has set step[j] = p and is not blocked at stage j is said to 
have crossed stage j. 



Lemma 1. The algorithm PHI assures that at any time the number of processes 
that have crossed stage j is at most n — j, for l<j<n — l,n>l. 

Proof The proof is by induction on j. Let C = (V/cy^p,pos[/c] < j) V {step[j] y^p). 

For j = 1, we need to consider only the case where all n processes are 
competing for the CS and, every process has set step[l]. The process, say p, 
which sets its pid to step[l] last is blocked at stage 1, because the condition C, 
for jf = 1, is false for p. The assertion follows. 

Assume as induction hypothesis that for some m, between 1 and n — 2, 
the number of processes that have crossed stage m is at most n — m. Only these 
processes will enter stage m+1 and set step[m-\-l]. The last process among them, 
say p, that sets step[m + 1] is blocked, because the condition C, for j = m + 1, 
is false for p. Therefore the number of processes that could cross stage m + 1 is 
at most n — (m + 1). The assertion follows. □ 



Theorem 1. The algorithms PHI assures mutual exclusion. 

Proof By Lemma 1, at any time, the number of processes that can cross stage 
n — 1, and enter the CS, is n — (n — 1), that is, 1. □ 



Lemma 2. At any time, (a) the number of processes blocked at any stage is at 
most 1, and (b) at least one process with the highest pos value is unblocked. 

Proof. Part (a) follows from the fact that the only process that can be blocked at 
a stage k is the process whose pid is equal to step[k]. This fact also implies part 
(b) when there are two or more processes with the highest pos value. When there 
is only one process with the highest pos value, say j, we have (V/c ^ j,pos[k] < j) 
and therefore this process is not blocked. □ 



Theorem 2. The algorithm PHI assures freedom from starvation, that is, a 
competing process will enter the critical section in a finite time. 

Proof. We need to show only that no process will remain blocked at any par- 
ticular stage for ever. Since a process may get blocked at most n — 1 times, the 
assertion would then follow. 

Consider a process p blocked at some stage k. Suppose the condition 
step[j] 7^ i never becomes true for p. Then, there must be a set Q of one or 
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more processes at stages higher than k. By Lemma 2(b), at any time, at least 
one process with the highest pos value is unblocked (different processes in Q 
may have the highest pos value at different times, and different processes may 
be unblocked at different times). By Assumption 1 every unblocked process will 
complete the execution of its current instruction in finite time and, by Assump- 
tion 2 , every process which enters the CS will complete the CS execution in a 
finite time. This guarantees that all processes in Q will eventually complete their 
CS executions in finite time, and then the condition {\/k ^ i^pos[k] < j) will 
be satisfied for p. Then, p will become unblocked and can move up to the next 
stage. □ 

The unfairness index of the algorithm PHI is not bounded, if n > 2. The 
following scenario illustrates this. The processes pi,P 2 , and ps are currently 
competing for the CS. 

— Process pi starts first, sets pos[pi] = 1, 

— Ps starts and assigns its pid to step[l]^ 

— p 2 sets its pid to step[l] and so ps is pushed, 

— since the condition (VA: 7 ^ p 2 ^pos[k] < 1) V {step[l] P 2 ) is not true, p 2 is 
blocked at stage 1 , 

— Ps crosses stage 1 to stage 2 , 

— since the condition (V/c 7 ^ ps^pos[k] < j) is true, for j > 2 , it keeps proceeding 
further, enters the CS, and completes its CS execution, 

— Ps starts competing again for the CS, and sets its pid to step[l], 

— the condition {step[l] 7 ^ P 2 ) becomes true, that is, p 2 is unblocked and ps is 
blocked at stage 1 , 

— p 2 moves up all the way, enters and leaves the CS, starts competing again, 
and sets its pid to step[l]^ 

— this time p 2 gets blocked and ps is unblocked, at stage 1 , and 

— p 2 andps can overtake pi, alternately, several times until pi sets step[l] = pi. 

This implies that the unfairness index of PHI is not bounded. 

Note : The above result holds even if we replace the “first write operation” by 
“a bounded number of write operations” in Definition 1.1, as long as the bound 
is less than 2 (n — 2 ). Any number of overtakes on a process p is possible at 
stage jf, for every j up to n — 2. When p is in stage n — 1, that is, after 2(n — 2) 
write operations, it can be overtaken by at most one process. 

The unfairness index can be bounded by modifying the algorithm slightly. 
One such modification is shown in PiL 2 , in Figure 2 . As seen from the above 
scenario, the unboundedness is due to the fact that unblocked processes may 
“sleep” for a long time or may execute very slowly compared to other processes. 
Algorithm PH2 “speeds up” such slow processes; really, it “slows down” fast 
processes until slower ones progress considerably: each process ready to enter 
the CS waits until all other competing processes progress at least until the end 
of their current stages. The proof of the safety property of PH2 is similar to 
that of PHI. 
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Theorem 3. The algorithm PH2 has (a) unfairness index n{n — 1)/ 2 and (b) 
overtake index {n — 1). 

Proof. In Pi/2, every process executes line 7 before entering the CS. There, 
it waits until every other competing process is blocked (when it observes that 
process; later on, that process might have become unblocked and moved up). 
This assures that every unblocked process reaches at least the end of its current 
stage, and if it has already crossed the current stage and not yet entered the 
next stage, then it reaches at least the end of the next stage. 

Consider a process p at or about to enter stage j. By Lemma 1, at most 
n — {j — 1) processes can (cross stage j — I and) enter stage j. In the worst case, 
all the other n — j processes can overtake p in stage j. When the first among 
these executes line 7, p would be blocked at stage j. It will remain blocked at 
most until all these n — j processes finish their CS executions. At that time, 
p will notice that the condition (V/c ^ p,pos[k] < j) is satisfied and will cross 
stage j. 

Thus, at most n — j processes can overtake p in stage j. Therefore the max- 
imum number of overtakes, that is, the unfairness index, is ~ j) ~ 

n{n — l)/2. 

Part (b) follows from the fact that a process q can overtake a process p at 
every stage. □ 

3 Reducing Unfairness Index and Overtake Index 

The unfairness index of n(n — l)/2 can be attributed to the following character- 
istics of Peterson’s algorithm. 

(1) A process crosses a stage by being pushed by another process. 

(2) Only the (single) process at the highest stage can also cross, by itself, without 
being pushed. 

Therefore, unless being pushed, a process p blocked at a stage is not allowed 
to cross that stage until all processes at higher stages finish the CS executions. 
Then, when p finally crosses, several new processes may overtake p. In the fol- 
lowing, we derive new algorithms by keeping (1) but modifying (2). We show 
that the new strategies help to reduce the unfairness index. We consider two 
alternatives to (2): 

(2a) Processes at all stages can also cross without being pushed, under eertain 
safe eonditions. 

(2b) Only the process at the lowest stage can also cross without being pushed, 
under eertain safe eonditions. 

The safe conditions of (2a) and (2b) are different. 
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Algorithm PAl, described in Figure 3, implements (2a) Here l{j) is the 
number of processes whose pos value is less than j. A process is allowed to cross 
stage j if there are at least j processes in stages 0 to j — 1. We note that when 
there is no process in the CS, (2) is automatically satisfied for the single process 
in the highest stage; it will observe l{j) = n — 1. Using l{j) notation the algorithm 
PHI can be stated as PPl', in Figure 4. (We note that l{j) is simply the value 
computed by the current process by reading the pos values. It is not a “global” 
value.) 



Process i: 

1 . for j = 1 to n — 1 do 

2. begin 

3. pos[i] :=j; 

4- step[j] :=i; 

5. wait until (l{j) > j)^ {step[j]^i) 

6. end; 

7. pos[i] 0; 

Figure 3 : Algorithm PAl 



Process i: 

1 . for j = 1 to n — 1 do 

2. begin 

3. pos[i\ ;= j; 

4- step[j] := i; 

5. wait until (l{j) = n — 1) V (step[j] ^ i) 

6. end; 

7. pos[i] 0; 

Figure 4 : Algorithm PHI' 



Lemma 3. The algorithm PAl assures that at any time the number of proeesses 
that have erossed the stage j is at most n — j , for (1 < j < n — 1), n > 1. 

The proof is same as for Lemma 1 with C = {l{j) > j) V {step[j] ^ p). 

Theorem 4. The algorithm PAl assures mutual exelusion. 

The proof follows from Lemma 3. 

The unfairness index of the algorithm PAl is also not bounded, if n > 2. The 
following scenario illustrates this. Suppose that processes pi and p 2 are currently 
competing for the CS. 

— Process p\ starts first, sets pos[pi] = 1, 

— p 2 starts, sets pos[p 2 ] = 1, and step[l] =P 2 , 

— since /(I) = n — 2 > 1 for p 2 , P 2 crosses stage 1, 

— l{j) = n — 1 > jf, for all j from 2 to n — 1, for p 2 and so it moves up all the 
stages, enters the CS, and leaves the CS, 

— p 2 starts competing again, sets pos[p 2 ] = 1, step[l] = p 2 , and 

— this cycle may repeat many times until p\ sets step[l] = p\. 

^ Our couveutiou iu labeliug the algorithms is that the secoud letter iudicates whether 
the highest stage process (H), all processes (A), or the lowest stage process (L) cau 
cross by themselves without beiug pushed. 
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Thus the number of overtakes of p 2 over pi is unbounded. Hence the unfairness 
index of the algorithm PAl is not bounded. 

Here also the unboundedness is due to (unblocked) processes being too slow 
relative to other competing processes. Again, slowing down faster processes until 
the slower ones progress considerably, we can bound the unfairness index. One 
way of accomplishing this is given in PA2, in Figure 5. Here lp{j) is the number 
of processes whose pos value is less than j but not zero. This algorithm has 
unfairness index (n — 1) and overtake index 1. The proofs are similar to those 
for algorithm PLl given in Figure 6. 



Process i: 

1 . for j = 1 to n — 1 do 

2. begin 

3. pos[i\ :=j; 

4- step[j] :=i; 

5. wait until (l{j) > j) V (step[j] ^ i) 

6. wait until (lp{j) = 0) V (/(j) < j — 1) 

7. end; 

8. es.i; 

9. pos[i] := 0; 

Figure 5 : Algorithm PA2 



Process i: 

1 . for j = 1 to n — 1 do 

2. begin 

3. pos[i] := j; 

4- step[j] := i; 

5. wait until = 0) A (A'po > j) 

V{step[j] ^ i) 

6. end; 

7. es.i; 

8. pos[i] := 0; 

Figure 6 : Algorithm PLl 



Algorithm PLl implements the characteristic (2b). Here Npo is the number 
of processes whose pos values is 0 (again as observed by the current process). 

Lemma 4. The algorithm PLl assures that at any time the number of proeesses 
that have crossed the stage j is at most n — j , for (1 < j < n — 1), n > 1. 

The proof is same as for Lemma 1 with C = {{lp{j) = 0) A {Npo > j)) V 
{step[j] ^ p). 

Theorem 5. The algorithm PLl assures mutual exclusion. 

The proof follows from Lemma 4. 

We show that PLl has unfairness index n — 1. 

Theorem 6. The algorithm PLl has overtake index 1. 

Proof. Suppose a process p overtakes another process q in PLl. That is, q sets 
pos[q\ = 1 before p sets pos[p] = 1, but p enters the CS before q does. Then 
the wait condition in line 5 for j = n — 1 can be satisfied for p only by p being 
pushed, since it would observe at least two processes, namely p and q, to be 
competing and so will compute Npo ^ n — 1. 



Fair and Efficient Mutual Exclusion Algorithms 



175 



When p is pushed off stage n — 1, let g' be in stage m. If m = n — 1, then 
since at most two competing processes could cross the stage n — 2, by Lemma 4, 
q would have been the only process other than p in stage n — 1, and so q would 
have pushed p to the CS, and would be blocked at stage n — 1. Now, suppose 
m < n — 1. Then p would have crossed stages m + 1 and above only by being 
pushed, since it will observe lp{j) ^ 0 due to q in stage less than m + 1. Then, 
(i) there must be a blocked process at every stage above m, (ii) the number of 
processes at stages m or above, can only be n — (m — 1) including q^ by Lemma 4, 
(iii) the number of stages above m is n — m, and these imply that q is the only 
process at stage m. Also, due to this number of processes being n — (m — 1), 
the last process that crossed stage m would have observed Npo < m — 1, and 
so, would have crossed only by being pushed, by the only remaining process at 
stage m, namely q. That is, q is blocked at stage m. 

So, after p completes CS execution, q will move up to the next stage, either 
by itself or by being pushed. In any case, no other process can enter the stage 
m + 1, until q pushes the process blocked in that stage, and gets itself blocked. 
This pattern continues until q enters the critical section. That is, q cannot be 
overtaken by any process, including p if it starts competing again. Hence the 
overtake index of PLl is 1. □ 



Theorem 7. The algorithm PLl has unfairness index n — 1. 

Proof. By Theorem 6, each process can overtake another process at most once. 
Since the total number of processes in the system is n, at most n — 1 overtakes 
are possible before a process enters the CS. Hence the unfairness index of PLl 
is n — 1. □ 

4 Improving Space Efficiency 

It turns out that the wait condition of PLl can be simplified to {Npo > j) V 
{step[j] 7^ i), still yielding unfairness index n — 1. The resulting algorithm would 
have the characteristic (2a) instead of (2b). 

To compute Npo, we need to know only whether pos value is zero or nonzero, 
that is, whether a process p is competing or not. Therefore the algorithm can 
be simplified with boolean door value instead of integer pos. The array door is 
initialized to 0. The algorithm is PAS, in Figure 7. 

It is interesting to note that an algorithm with characteristic (2) and un- 
fairness index n{n — l)/2 can be obtained by a simple modification of the wait 
condition. This is algorithm PPS, given in Figure 8. 

Suppose m processes are competing. In both PH2 and PP3, a process p can 
cross without being pushed only if all other processes are in the lower stages. An 
additional requirement in PH3 is that p must be in stage m or above, whereas in 
PP2, p could be anywhere, even in a stage below m. This additional requirement 
in PH3 ensures that, when p moves up without being pushed, each stage j, for j 
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Process i: 

1. door[i] := 1; 

2. for j — 1 to n — 1 do 

3. begin 

4- step[j] :=i; 

5. wait until (n — door[k] > j) 

V{step[j] ^ i) 

6. end; 

7. esA; 

8. door[i] := 0; 

Figure 7 : Algorithm PA?> 



Process i: 

1. door[i] \= 1; 

2. for j — 1 to n — 1 do 

3. begin 

4- step[j] := i; 

5. wait until door[k] < j) 

V{step[j] ^ i) 

6. end; 

1. esA; 

8. door[i] := 0; 

Figure 8 : Algorithm PH?> 



between 1 and m — 1, has exactly one process, and this process will cross stage j 
after at most n — j processes overtake it. 

The proofs of the properties for PAS and PH3 are similar, respectively, to 
those of PA2 and PH2. 



5 Improving Time Efficiency 

The algorithms shown in the Figures 7 and 8 are also amenable to improving 
time efficiency. We observe that when the number of competing processes is less 
than n, it is unnecessary for a process to start from stage 1. For example, if there 
is only one process competing for the CS, then the process can ‘jump’ directly to 
the CS (to the stage n). In general, if there are k processes competing for the CS, 
a process can jump directly to the stage n — /c + 1, and continue with the rest 
of the algorithm from thereon. This is done with algorithm PAS. The resulting 
algorithm would have characteristic (2b) instead of (2a). This is algorithm PL2, 
given in Figure 9. (The same modification is applicable to PHS also.) Here, 
the number of iterations in the loop, and hence the number of shared variable 
accesses, depends on the contention observed by the process. 

This modification in PL2 also assures additional fairness in the following 
sense: if a process p sets its door value after another process q made its initial 
jump, then p cannot overtake q. That is, the maximum number of overtakes 
depends only on the current contention. 



6 Algorithms for /-exclusion Problem 

This section deals with a generalization of mutual exclusion problem, called l- 
exclusion problem, introduced by Fischer et al. in [FLBB79], and subsequently 
studied in [ADGMS94, AM94]. The required properties of an /-exclusion algo- 
rithm are: 
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Process i: 

1. door[i] := 1; 

2. for j — n — door[k] -\- 1 to n — 1 do 

3. begin 

4 . step[j] := i; 

5. wait until (n — door[k] > j) W{step[j] ^ i) 

6. end; 

7. es.i; 

8. door[i] := 0; 

Figure 9 : Algorithm PL2 



— Safety (I -exclusion): At any time, at most I processes are allowed to be in 
the critical section simultaneously. 

— Liveness (I -lockout avoidance) : When there are strictly less than I processes 
in the critical section, at least one process from the group of processes waiting 
for the CS should reach the CS in a finite time. 

Fischer et al. [FLBB79] discussed a way of reducing /-exclusion problem to 
one-exclusion problem and then applying known solutions to the later problem. 
This solution is commonly used in banks for scheduling people waiting for a 
teller. The main drawback of this approach is that, if / > 2 copies of the resource 
are free, instead of allowing the first / processes to move “simultaneously” to 
“tellers” , the algorithm requires them to file past to the front of the line one at 
a time. If the process at the front of the line is slow, then / — 1 processes behind 
it are forced to wait. In fact, if the process at the front of the line “fails”, then 
the processes behind it wait forever and the system stops functioning. In this 
case, one failure can tie up all of the system’s resources [FLBB79]. 

As noted in [AM94], all prior /-exclusion algorithms for shared memory sys- 
tems either require unrealistic atomic operations or perform badly. We show 
that many of the mutual exclusion algorithms in this paper can be modified 
very slightly to obtain /-exclusion algorithms. 

Except PHI and Pi72, all the algorithms presented in the previous sections 
can be easily modified to solve /-exclusion problem, simply by changing n — 1 to 
n — / in the Tor statement’ of each algorithm. We present the /-exclusion version 
of PL2 only, in Figure 10. 

To extend Peterson’s algorithm, PHI, to solve the /-exclusion problem (prob- 
lem 10.13 in Lynch book [L96]), the simple modification adopted for our algo- 
rithms, that is just changing the number of stages to be crossed, is not appro- 
priate. Instead the condition for a process to cross a stage without being pushed 
has to be modified. In PHI' , the modification would he l{j) > n — I instead of 
Kj) = n — 1, in the wait statement. In addition, the above modification, that is, 
changing n — 1 to n — /, can also be done to improve efficiency. The resulting 
algorithm is shown in Figure 11. 
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Process i: 

1 . door[i] := 1; 

2. for j = n- Ylk=i door[k] + 1 

to n — I do 

3. begin 

4 . step[j] := i; 

5. wait until (n — '^2=i door[k] > j) 

V{step[j] ^ i) 

6 . end; 

7. es.i; 

8 . door[i] := 0; 

Figure 10 : Algorithm 1-PL2 



Process i: 

1 . for j = 1 to n — I do 

2 . begin 

3. pos[i] := j 
4- step[j] := i; 

5. wait until {l{j) > n — 1) \/ {step[j] ^ i) 

6 . end; 

7. es.i; 

8 . pos[i] 0; 

Figure 11 : Algorithm l-PHl' 



The correctness of these algorithms follows from the correctness proofs of 
their one-exclusion versions. In contrast to the algorithm of [FLBB79], our l- 
exclusion algorithms can tolerate up to / — 1 failures and therefore can be used 
to build fault-tolerant distributed systems. 

7 Concluding Remarks 

Peterson’s n-process mutual exclusion algorithm has intrigued researchers since 
its publication. It has been extensively analyzed, and yet certain properties have 
eluded the researchers. This paper attempts to fill the gap in the analysis. We 
have (i) shown the unboundedness of the unfairness index of the algorithm and 
that the algorithm can easily be modified to bound the unfairness index to 
n(n — l)/2, and (ii) captured the underlying characteristic of the algorithm that 
sets the bound to the above value. Then, employing complementary character- 
istics, we have devised new mutual exclusion algorithms with unfairness index 
n — 1. We have also given, through simple extensions, several elegant algorithms 
that improve space and time efficiency, and for solving /-exclusion problem. 

Though the first software solutions for the mutual exclusion problem were 
presented as early as 1965, new solutions keep emerging for this problem ei- 
ther to meet the needs of different technologies (such as multiprogramming, 
multitasking, multiprocessing, parallel processing, distributed processing, and 
multithreading), or to achieve newly conceived criteria (such as self stabiliza- 
tion, scalability, and adaptiveness), or to effectively exploit some of the system 
characteristics (such as sparse situation where the total number of potentially 
contending processes is very large but the number of processes expected to be 
competing at any time is very small, distributed shared memory machines where 
the access of each shared variable involves traversing of interconnection network, 
and thread synchronization (threads are created within a single process, normally 
they are not large in number, and mostly all are executed in the same proces- 
sor)), or to improve over existing ones with respect to certain quantitative or 
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qualitative measures. This diversity in the classes of mutual exclusion algorithms 
makes a meaningful global comparison difficult. 

Fairness is a crucial attribute of distributed algorithms. This criterion de- 
termines for how long a process, after performing the first write operation on a 
shared variable, indicating its interest in the CS, has to wait. All the solutions 
to the mutual exclusion problem are expected to be starvation free. However, 
ensuring starvation- freedom alone may not be sufficient in many applications; 
stronger fairness criteria are desirable. This paper focuses on two such criteria, 
unfairness and overtake index. 
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Abstract. We present an A-process algorithm for mutual exclusion un- 
der read/ write atomicity that has 0(1) time complexity in the absence 
of contention and O(logA) time complexity under contention, where 
“time” is measured by counting remote memory references. This is the 
first such algorithm to achieve these time complexity bounds. Our algo- 
rithm is obtained by combining a new “fast-path” mechanism with an 
arbitration-tree algorithm presented previously by Yang and Anderson. 



1 Introduction 

Recent work on mutual exclusion [3] has focused on the design of “scalable” algo- 
rithms that minimize the impact of the processor-to-memory bottleneck through 
the use of local spinning. A mutual exclusion algorithm is scalable if its perfor- 
mance degrades only slightly as the number of contending processes increases. 
In local-spin mutual exclusion algorithms, good scalability is achieved by requir- 
ing all busy-waiting loops to be read-only loops in which only locally-accessible 
shared variables are accessed that do not require a traversal of the processor- 
to-memory interconnect. A shared variable is locally accessible on a distributed 
shared-memory multiprocessor if it is stored in a local memory module, and on 
a cache- coherent multiprocessor if it is stored in a local cache line. 

A number of queue-based local-spin mutual exclusion algorithms have been 
proposed in which only 0(1) remote memory references are required for a process 
to enter and exit its critical section [1,4,6]. In each of these algorithms, waiting 
processes form a “spin queue”. Read-modify-write instructions are used to en- 
queue a blocked process on this queue. Performance studies presented in [1,4,6] 
have shown that these algorithms scale well as contention increases. 

In subsequent work, Yang and Anderson showed that performance compara- 
ble to that of the queue-lock algorithms cited above could be achieved using only 
read and write operations [8]. In particular, they presented a read/write mutual 
exclusion algorithm with 0(log A) time complexity and experimentally showed 
that this algorithm is only slightly slower than the fastest queue locks. In Yang 
and Anderson’s algorithm, instances of a local-spin mutual exclusion algorithm 
for two processes are embedded within a binary arbitration tree, as depicted in 
Fig. 1(a). The entry and exit sections associated with the two links connecting 

* Work supported by NSF grant CCR 9732916. The first author was also supported 
by an Alfred P. Sloan Research Fellowship. 

P. Jayanti (Ed.): DISC’99, LNCS 1693, pp. 180-195, 1999. 

© Springer- Verlag Berlin Heidelberg 1999 
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Fig. 1. Yang and Anderson’s arbitration-tree algorithm (inset (a)) and its fast-path 
variant (inset (b)). 



a given node to its sons constitute a two-process mutual exclusion algorithm. 
Initially, all processes start at the leaves of the tree. To enter its critical section, 
a process traverses the path from its leaf to the root, executing the entry section 
of each link on this path. Upon exiting its critical section, a process traverses 
this path in reverse, executing the exit section of each link. 

Although Yang and Anderson’s algorithm exhibits scalable performance, in 
complexity-theoretic terms, there is still a gap between the 0{\ogN) time com- 
plexity of their algorithm and the constant time complexity of algorithms based 
on stronger synchronization primitives. This gap is particularly troubling when 
considering performance in the absence of contention. Even without contention, 
the arbitration-tree algorithm forces each process to perform 0{\ogN) remote 
memory references in order to enter and exit its critical section. To alleviate this 
problem, Yang and Anderson presented a variant of their algorithm that includes 
a “fast-path” mechanism that allows the arbitration tree to be bypassed in the 
absence of contention. This variant is illustrated in Fig. 1(b). This algorithm 
has the desirable property that contention- free time complexity is 0(1). Unfor- 
tunately, it has the undesirable property that time complexity under contention 
is 0{N) in the worst case, rather than 0{\ogN). In Yang and Anderson’s fast- 
path algorithm, a process checks whether the fast path can be reopened after 
a period of contention ends by “polling” each process individually to see if it is 
still contending. This polling loop is the reason why the time complexity of their 
algorithm is 0{N) in the worst case. 

To this day, the problem of designing a read/write mutual exclusion algorithm 
with 0(1) time complexity in the absence of contention and 0{\ogN) time com- 
plexity under contention has remained open. In this paper, we close this problem 
by presenting a fast-path mechanism that achieves these time complexity bounds 
when used in conjunction with Yang and Anderson’s arbitration-tree algorithm. 
Our fast-path mechanism has the novel feature that it can be reopened after a 
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period of contention without having to poll each process individually to see if it 
is still contending. 

The rest of this paper is organized as follows. In Sec. 2, we present our fast- 
path algorithm. In Sec. 3, we prove that the algorithm is correct. We end the 
paper with concluding remarks in Sec. 4. 

2 Fast-Path Algorithm 

Our fast-path algorithm is shown in Fig. 2. In this section, we explain informally 
how the algorithm works. We begin with a brief overview of the code. We assume 
that each labeled sequence of statements in Fig. 2 is atomic; each such sequence 
reads or writes at most one shared variable. A process determines if it can 
access the fast path by executing statements 1-9. If a process p detects any 
other competing process while executing these statements, then p is “deflected” 
out of the fast path and invokes either SLOWl or SL0W2. SLOWl is invoked 
if p has not updated any variables that must be reset in order to reopen the fast 
path. Otherwise, SLOW2 is invoked. A detailed explanation of the deflection 
mechanism is given below. If a process is not deflected, then it successfully 
acquires the fast path, which consists of statements 10-20. A process that either 
acquires the fast path or is deflected to SLOW2 attempts to reopen the fast path 
by executing statements 13-20 or 29-37, respectively. A detailed explanation of 
how the fast path is reopened is given below. 

Before entering its critical section, a fast-path process must perform the en- 
try code of the two-process mutual exclusion algorithm on top of the arbitration 
tree, as shown in Fig. 1(b). It executes this code using 0 as a virtual process 
identifier. This is denoted as “ENTRY_2(0)” in Fig. 2 (see statement 11). The cor- 
responding two-process exit code is denoted “EXIT_2(0)” (statement 19). Each 
process p that is deflected to SLOWl or SLOW2 must first compete within 
the Wprocess arbitration tree (using its own process identifier). The entry and 
exit code for the arbitration tree are denoted “ENTRY J\f(p)” and “EXITJ\f(p)”, 
respectively (statements 21, 25, 26, and 39). After competing within the arbi- 
tration tree, a deflected process accesses the two-process algorithm on top of the 
tree using 1 as a virtual process identifier. The entry and exit code for this are 
denoted “ENTRY_2(1)” and “EXIT_2(1)”, respectively (statements 22, 24, 27, and 
38). 

We now explain our fast-path acquisition mechanism in detail. At the heart 
of this mechanism is the following code fragment from Lamport’s fast mutual 
exclusion algorithm [5]. 

shared variable X: 0..N — 1; Y: boolean initially true 
process p:: 

Noncritical Section; 

X:=p; 

if then “compete with other processes (slow path)” 
elseY false; 

if X ^ p then “compete with other processes (slow path)” 
else “take the fast path” 
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type Ytype = record free: boolean; indx: 0..N — 1 end /* stored in one word */ 

shared variable 
X: 0..N - 1; 

y, Reset: Ytype initially (true,0); 

Slot, Proe: arra.y[0..N — 1] of boolean initially /a/se; 

Infast: boolean initially /a/se 

private variable y: Ytype 



process p:: /*0<_p<iV*/ 

while true do 
0: Noncritical Section; 

1 : X:^p- 
2: y E; 

if ^y.free then SLOWlQ 
else 

3: Y := {false, 0); 

4: Proe\p] := true; 

5: if{X^pW 

6: Infast) then SLOW2{) 

else 

7: Slot[y.indx] := true; 

8: if Reset ^ y then 

9: Slot[y . indx] false; 

SLOW2{) 

else 

10: Infast := true; 

/* fast path */ 

11: ENTRY_2(0); 

12: Critical Section; 

13: Proe\p] := false; 

14: Reset := {false, y .indx); 

15: if ^Proc[y.mdx] then 

16: Reset := 

{true, y .indx + 1 mod iV); 
17: Y := 

{true, y .indx + 1 mod iV) 

fi; 

18: := /a/se; 

19: EXIT_2(0); 

20: Infast := false 

fi fi fi 
od 



procedure S'L01F1() 

21:ENTRY_N(p); 

22: ENTRY_2(1); 

23: Critical Section; 

24: EXIT_2(1); 

25:EXIT_N(p) 

procedure SLOW2{) 

26:ENTRY_N(p); 

27: ENTRY_2(1); 

28: Critical Section; 

29: Y := {false, 0); 

30: X := p; 

31: y :— Reset; 

32: Proe[p] := false; 

33: Reset := {false, y .indx); 

34: if {^ Slot [y .indx] A 

35: ^Proe[y .indx]) then 

36: Reset := 

{true, y. indx + 1 mod N); 
37: Y := 

{true, y. indx + 1 mod N) 

fi; 

38: EXIT_2(1); 

39:EXIT_N(p) 



Fig. 2. Fast-path algorithm. 
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This code ensures that at most one process will “take the fast path”. Moreover, 
with the stated initial conditions, if one process executes this code fragment in 
isolation, then that process will take the fast path. The problem with using this 
code is that, after a period of contention ends, it is difficult to “reopen” the fast 
path so that it can be acquired by other processes. If a process does succeed in 
taking the fast path, then that process can reopen the fast path itself by simply 
resetting Y to true. On the other hand, if no process succeeds in taking the fast 
path, then the fast path ultimately must be reopened by one of the slow-path 
processes. Unfortunately, because processes are asynchronous and communicate 
only by means of atomic read and write operations, it can be difficult for a 
slow-path process to know whether the fast path has been acquired by some 
process. 

As a stepping stone towards our algorithm, consider the algorithm shown in 
Fig. 3, which uses unbounded memory to solve the problem. In this algorithm, Y 
has an additional field, which is an identifier that is used to “rename” any process 
that acquires the fast path. This identifier will increase without bound over 
time, so we will never have to worry about the possibility that two processes 
are renamed with the same identifier. With this added field, a slow-path process 
has a way of identifying a process that has taken the fast path. To see how 
this works, consider what happens when, starting from the initial state, some 
set of processes execute their entry sections. At least one of these processes will 
read Y = (true,0) at statement 2 and assign Y := {false, 0) at statement 3. 
By properties of Lamport’s fast-path code, of the processes that assign Y, at 
most one will reach statement 6. A process that reaches statement 6 will either 
acquire the fast path by reaching statement 9, or will be deflected to SLOW2 
at statement 8. 

This gives us two cases to analyze: Of the processes that read Y = {true,0) 
and assign Y, either all are deflected to SLOW2, or one, say p, acquires the fast 
path. In the former case, at least one of the processes that executes SLOW2 
will increment the indx field of Y and set the free field of Y to true (statement 
28). This has the effect of reopening the fast path. In the latter case, we must 
argue that (i) the fast-path process p reopens the fast path after leaving it, and 
(ii) no SLOW 2 process “prematurely” reopens the fast path before p has left 
the fast path. Establishing (i) is straightforward. Process p will reopen the fast 
path by incrementing the indx field of Y and setting the free field of Y to true 
(statement 13). Note that the Infast variable prevents the reopening of the fast 
path from actually taking effect until after p has finished executing EXIT_2(0). To 
establish (ii), suppose, to the contrary, that some SLOW2 process q reopens the 
fast path by executing statement 28 while p is executing within statements 9-15. 
For this to happen, q must have read Slot[0] at statement 26 before p assigned 
Slot[0] := true at statement 6. This in turn implies that q executed statement 25 
before p executed statement 7. Thus, p must have found Reset ^ at statement 
7, i.e., it was deflected to SLOW2, which is a contradiction. It follows from the 
explanation given here that after an initial period of contention ends, we must 
have Y.free = true and Y.indx > 0. This argument can be applied inductively 



Fast and Scalable Mutual Exclusion 



185 



type Ytype = record free: boolean; indx: O ..00 end /* stored in one word */ 

shared variable /* other variable declarations are as in Fig. 2 */ 

Slot: array[0..oo] of boolean initially /a/se 



process p:: 0 < p < N 

while true do 
0: Noncritical Section; 

1: X:=p; 

2: y E; 

if ^y.free then SLOWlQ 
else 

3: Y := {false, 0); 

4: if{X^pV 

5: Infast) then SLOW2{) 

else 

6: Slot [y. indx] := true; 

7: if Reset ^ y then 

8: Slot[y.indx] :— false] 

SLOW2Q 
else 

9: Infast := true; 

/* fast path */ 

10: ENTRY_2(0); 

11: Critical Section; 

12: Reset := {true, y. indx + 1); 

13: Y {true, y. indx 1); 

14: EXIT_2(0); 

15: Infast := false 

fi fi fi 
od 



procedure SL0W1{) 

16:ENTRY_N(p); 

17: ENTRY_2(1); 

18: Critical Section; 

19: EXIT_2(1); 

20:EXIT_N(p) 

procedure SLOW2{) 

21:ENTRY_N(p); 

22: ENTRY_2(1); 

23: Critical Section; 

24: y :— Reset; 

25: Reset := {false, y .indx); 

26: if ^Slot[y.indx] then 

27: Reset := {true, y. indx + 1); 

28: Y :— {true , y .indx 1) 

fi; 

29: EXIT_2(1); 

30:EXIT_N(p) 



Fig. 3. Fast-path algorithm with unbounded memory. 



to show that the fast path is properly reopened after each period of contention 
ends. 

Of course, the problem with this algorithm is that the indx field of Y that 
is used for renaming will continue to grow without bound. The algorithm of 
Fig. 2 solves this problem by requiring Y.indx to be incremented modulo- A^. 
With Y.indx being updated in this way, the following potential problem arises. 
A process p may reach statement 7 in Fig. 2 with y.indx = k and then get 
delayed. While delayed, other processes may repeatedly increment Y.indx (in 
SLOW2) until it “cycles back” to k. At this point, another process q may reach 
statement 7 with y.indx = k. This is a problem because p and q may interfere 
with each other in updating Slot[k]. The algorithm in Fig. 2 prevents such a 
scenario from happening by preventing Y.indx from cycling while some process 
executes within statements 7-18. To see how this is prevented, note that before 
reaching statement 7, a process p must first assign Proc[p] := true at statement 
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4. Note further that before a process can increment Y.indx from n to n + 1 
mod N (statement 17 or 37), it must first check Proc[n] (statement 15 or 35) 
and find it to be false. This check prevents Y.indx from cycling while p executes 
within statements 7-18. As shown in the next section, the correctness of the code 
that reopens the fast path (statements 13-18 and 29-37) rests heavily on the fact 
that this code is executed within a critical section. 

3 Correctness Proof 

In this section, we prove that the algorithm in Fig. 2 is correct. Specifically, 
we prove that the mutual exclusion property (at most one process executes 
critical section at any time) holds and that the fast path is always open in the 
absence of contention. (The algorithm is easily seen to be starvation-free, given 
the correctness of ENTRY and EXIT calls.) The following notational conventions 
will be used in the proof. 

Notational Conventions: Unless stated otherwise, we assume i, j, and k range 
over {O..A^ — 1}. We use n.i to denote the statement with label n of process i, and 
to represent Fs private variable y. Let S' be a subset of the statement labels 
in process i. Then, holds iff the program counter for process i equals some 

value in S. □ 

Definition: We define a process i to be FAST-possible if the condition T(i), 
defined below, is true. 

F{i) = i@{3..8,10..20} A 

(i@{3..5} ^ X = i) A (i@{3..8} ^ Reset = i.y) □ 

Informally, this condition indicates that process i may potentially acquire the 
fast path. It does not necessarily mean that i is guaranteed to acquire the fast 
path: if F{i) holds, then process i still can be deflected to SLOWl or SL0W2. 
If a process i is at {3.. 9} and is not FAST-possible, then we define it to be 
FA ST- disabled. We will later show that a FAST-disabled process cannot acquire 
the fast path. We now turn our attention to the mutual exclusion property. 



3.1 Mutual Exclusion 

We will establish the mutual exclusion property by proving that the conjunction 
of a number of assertions is an invariant. This proves that each of these assertions 
individually is an invariant. These invariants are numbered (II) through (122) 
and are stated on the following pages. Informally, invariants (II) through (14) 
give conditions that must hold if a process is FAST-possible. Invariants (15) 
through (19) prevent “cycling”. These invariants are used to show that if i@{6..9} 
holds and process i is FAST-disabled, then Reset. indx must be “trapped” be- 
tween i. y.indx and i. Therefore, there is no way Reset can cycle back, erroneously 
making process i FAST-enabled again. Invariants (110) through (115) show that 
certain regions of code are mutually exclusive. Invariants (116) through (121) are 
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all simple invariants that follow almost directly from the code. Invariant (122) is 
the mutual exclusion property, our goal. 

In establishing these invariants, statements that might potentially establish 
F{i) must be repeatedly considered. The following lemma shows that only one 
such statement must be considered. 

Lemma 1: Ift and u are consecutive states such that F{i) is false at t hut true 
at u, and if each of (II) through (122) holds at t, then u is reached from t via 
the execution of statement 2.i. 

Proof: The only statements that could potentially establish F{i) are 2.i (which 
establishes i@{3..8, 10..20} and may establish Reset = i.y), b.i (which falsifies 
i@{3..5}), 8.i (which falsifies i@{3..8}), l.i (which establishes X = i), and 31. i, 
14.j, 16. j, 33. j, and 36. j, where j is any arbitrary process (which may establish 
Reset = i.y). We now show that none of these statements other than 2.i can 
establish F{i). 

Statement b.i can establish i@{ 6 }, and hence T(i), only if X = i holds at t. 
But, by (15), this implies that Reset = i.y holds at t as well. By the definition 
of T(i), this implies that F{i) holds at t, a contradiction. 

Statement 8 .i can establish i@{10}, and hence T(i), only if Reset = i.y holds 
at t. But this implies that F{i) holds at t, a contradiction. 

Statements l.i and 31.i establish i@{ 2 , 32}. Thus, they cannot establish F{i). 

Statements 14. j and 33. j could establish F{i) only if i@{3..8| A Reset ^ i.y 
holds at t, and upon executing 14.jf or 33. j. Reset = i.y is established. However, 
by (13) and (116), 14. j and 33. j can change the value of Reset only by changing 
the value of Reset. free from true to false. By (120), if i@{3..8j holds at t, then 
i.y. free = true holds as well. Thus, statements 14.j and 33. j cannot possibly 
establish Reset = i.y^ and hence cannot establish F{i). 

Statements 16.j and 36. j likewise can establish F{i) only if i@{3..8| A 
Reset 7 ^ i.y holds at t. We consider two cases, depending on whether i@{3..5| 
or i@{ 6 .. 8 j holds at t. If i@{3..5} A Reset 7 ^ i.y holds at t, then by (15), X ^ i 
holds at t. This implies that X ^ i holds at u as well, i.e., F{i) is false at u. 

Now, suppose that i@{ 6 .. 8 | A Reset 7 ^ i.y holds at t. By (117), statements 
16. j and 36. j increment Reset. indx by 1 modulo- A/". Therefore, they may estab- 
lish F{i) only if Reset. indx = {i.y. indx — 1) mod N holds at t. By (16), this 
implies that i = Reset. indx or i = i.y. indx holds at t. By (18), the latter implies 
that i = Reset. indx holds at t. Hence, in either case, i = Reset. indx holds at t. 
Because we have assumed that i@{ 6 .. 8 | A j@{16, 36} holds at t, by (17), we have 
a contradiction. Therefore, statements 16. j and 36. j cannot establish F{i). □ 

We now prove each invariant listed above. It is easy to see that each invariant 
hold initially, so we will not bother to prove this. For each invariant /, we show 
that for any pair of consecutive states t and u^ if all invariants hold at t, then / 
holds at u. In proving this, we do not consider statements that trivially don’t 
affect /. 



invariant F{i) A i@{4..8, 10..17} ^ Y = {false, 6) 



(II) 
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Proof: To prove that (II) is not falsified, it suffices to consider only those state- 
ments that may establish the antecedent or falsify the consequent. By Lemma 1, 
the only statement that can establish F{i) is 2.i. However, 2.i establishes i@{3} 
and thus cannot establish the antecedent. The condition i@{4..8, 10..17} may be 
established only by statement 3.i, which also establishes the consequent. 

The consequent may be falsified only by statements 17. j or 37.j, where j is 
any arbitrary process. If j = i, then both 17. j and 37. j establish i@{18,38}, 
which implies that the antecedent is false. 

Suppose that j ^ i. By (110) and (111), the antecedent and j@{17} cannot 
hold simultaneously (recall that j@{17} implies T(jf), by definition). Hence, 
statement 17. j cannot be executed while the antecedent holds. Similarly, by 
(112), (113), and (114), the antecedent and j@{37} cannot both hold. Hence, 
statement 37. j also cannot be executed while the antecedent holds. □ 

invariant F{i) A i@{8, 10..17} Slot[i.y.indx] = true (12) 

Proof: By Lemma 1, the only statement that can establish F{i) is 2.i. However, 
2.i establishes i@{3} and hence cannot establish the antecedent. The condition 
i@{8, 10..17} may be established only by statement 7.i, which also establishes 
the consequent. 

The consequent may be falsified only by statements 2.i, 31.i, 9.j, and 18.j, 
where j is any arbitrary process. Statements 2.i and 31. i establish i@{3, 21, 32}, 
which implies that the antecedent is false. If j = i, then 9.j and IS.j establish 
i@{19,26}, which implies that the antecedent is false. 

Suppose that j ^ i. In this case, statement 9.j may falsify the consequent 
only if i.y.indx = j.y.indx holds. By (115) (with i and j exchanged), j@{9} A 
i.y.indx = j.y.indx implies that the antecedent of (12) is false. Thus, 9.j cannot 
falsify (12). Similarly, by (111), if j@{18| holds (which implies that F{j) holds), 
then the antecedent of (12) is false. Thus, 18. j also cannot falsify (12). □ 

invariant i@{10..16} ^ Reset.indx = i.y.indx (13) 

Proof: The antecedent may be established only by statement 8.i, which does so 
only if Reset = i.y holds. Therefore, statement 8.i preserves (13). 

The consequent may be falsified only by statements 2.i, 31.i, 14.jf, 16.j, 33. j, 
and 36. j, where j is any arbitrary process. The antecedent is false after the 
execution oi 2. i and 31. i and also after the execution of 16.j, 33. j, and 36. j if 
j = i. 11 j = i, then statement 14. j preserves the consequent. 

Consider 14.jf, 16.j, 33. j, and 36. j, where j ^ i. By (111), the antecedent 
of (13) and j@{14, 16} cannot hold simultaneously (recall that i@{10..16} ^ 
F(i) and j@{14, 16} ^ F{j)). Similarly, by (114), the antecedent and j@{36} 
cannot hold simultaneously. Hence, statements Id.j, 16. j, and 36. j can be exe- 
cuted only when the antecedent is false, and thus do not falsify (13). By (116), 
statement 33. j cannot change Reset.indx. Hence, it does not falsify (13). □ 



invariant i@{11..20} Infast = true 



(14) 
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Proof: The antecedent may be established only by statement 10. i, which also 
establishes the consequent. The consequent may be falsified only by statement 
20.j, where j is any arbitrary process. If j = i, then statement 20.j also falsifies 
the antecedent. If j 7 ^ then by (111), the antecedent and j@{20} cannot both 
hold. Hence, the antecedent is false after the execution of statement 20. j. □ 

invariant i@{3..5} A X = i ^ Reset = i.y (15) 

Proof: The antecedent may be established only by statements l.i (which estab- 
lishes X = i) and 2.i (which may establish i@{3..5}). However, l.i establishes 
i@{ 2 } and hence cannot establish the antecedent. Also, by (119), statement 2 .i 
establishes the consequent. 

The consequent may be falsified only by statements 2.i, 3 l.i, 14.jf, 16.j, 33. j, 
and 36. j, where j is any arbitrary process. However, statement 2.i preserves (15) 
as shown above. Furthermore, the antecedent is false after the execution of 3 l.i 
and also after the execution of each of 14.j, 16. j, 33. j, and 36. j if j = i. 

Consider Id.j, 16.j, 33. j, and 36. j, where j 7 ^ i. If the antecedent and con- 
sequent of (15) both hold, then F{i) holds by definition. If j 7 ^ i, then by (110) 
and (112), j@{14, 16, 33, 36} cannot hold as well. Hence, these statements cannot 
falsify (15). □ 

invariant i@{6..9| ^ {i.y.indx < Reset. indx < i) V 

{Reset. indx < i < i.y.indx) V 

(i < i.y.indx < Reset. indx) (16) 

Proof: The antecedent may be established only if 5.i is executed when X = i 
holds. In this case, by (15), Reset = i.y holds, so the consequent is preserved. 

The consequent may be falsified only by statements 2.i, 3 l.i, 14.j, 16.j, 33. j, 
and 36. j, where j is any arbitrary process. The antecedent is false after the 
execution of 2.i and 3 l.i and also after the execution of each of 14.j, 16.j, 33. j, 
and 36. j if j = i. 

Consider statements 16. j and 36. j, where j 7 ^ i. By (117), these statements 
increment Reset. indx by 1 modulo- A^. Therefore, these statements may falsify 
the consequent only if Reset. indx = i holds before execution. However, in this 
case, by (17), the antecedent of (16) is false. Thus, statements 16. j and 36. j 
cannot falsify (17). 

Finally, consider 14. j and 33. j, where j ^ i. By (13) and (116), 14. j and 33. j 
don’t change Reset. indx. Hence, they can’t falsify the consequent. □ 

invariant i@{6..9} A Res et. indx = i ^ -i(3j :: j@{16, 36}) (17) 

Proof: The antecedent may be established only by statements 5.i, 14. /c, 16. /c, 
33. /c, and 36. k, where k is any arbitrary process. Statement 5.i establishes the 
antecedent only if executed when X = i holds. In this case, by (15), Reset = i.y, 
and hence F{i), holds as well. By (110) and (112), this implies that ^{^j :: 
j@{16,36}) also holds. This implies that statement 5.i cannot falsify (17). 
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If /c = i, then the antecedent is false after the execution of each of 14. /c, 16. /c, 
33.k, and 36.k. Uk^i, then by (122), (Vj :: A:@{16, 36} A j@{16, 36} ^ A: = j) 
holds. Therefore, 16. /c and 36. /c both establish (Vj :: -ij@{16, 36}), which is 
equivalent to the consequent. Now, consider statements 14. /c and 33. /c, where 
k ^ i. By (13) and (116), these statements do not change Reset. indx. It follows 
that, although statements 14. k and 33. k may preserve the antecedent, they do 
not establish it. 

The consequent may be falsified only by statements 15.j and 35. j, which 
may do so only if Proe[j .y .indx] = false. However, if the antecedent of (17) and 
j@{15,35} both hold, then the following hold: Reset. indx = j.y.indx, by (117), 
Reset. indx = i, by the antecedent, and Proe[i] = true, by (121). Taken together, 
these assertions imply that Proe[j .y .indx] = true. Therefore, statements 15. j or 
35. j cannot falsify the consequent while the antecedent holds. □ 

invariant i@{6..9} A i.y.indx = i ^ Reset.indx = i.y.indx (18) 

Proof: The antecedent may be established only by statements 2.i, b.i, and 31. i. 
However, statements 2.i and 31.i establish i@{3,32}, which implies that the 
antecedent is false. Furthermore, by (15), statement 3.i preserves the consequent. 

The consequent may be falsified only by statements 2.i, 31.i, 14.j, 16.j, 33. j, 
and 36. j, where j is any arbitrary process. However, the antecedent is false after 
the execution of 2.i and 31. i and also after the execution of each of 16.j, 33. j, 
and 36. j if j = i. 

Consider statements 14.j, 16.j, 33. j, and 36.j, where j 7 ^ i. By (13) and (116), 
statements 14. j and 33. j do not change Reset.indx, and hence cannot falsify 
the consequent. Note also that, by (17), the antecedent, the consequent, and 
j@{16,36} cannot all hold simultaneously. Hence, statements 16. j and 36. j can- 
not falsify the consequent when the antecedent holds. □ 

invariant i@{9} A Reset.indx = i.y.indx ^ Reset. free = false (19) 

Proof: (19) may be falsified only by statements 2.i, S.i, 31. i, 14. j, 16. j, 33. j, 
and 36. j, where j is any arbitrary process. Statements 2.i and 31. i establish 
i@{3, 32}, which implies that the antecedent is false. Statement S.i establishes 
the antecedent only if executed when Reset ^ i.y A Reset.indx = i.y.indx holds, 
which implies that Reset. free 7 ^ i.y.free. However, by (120), i.y.free = true. Thus, 
Reset. free = false. 

If j = i, then each of 14.j, 16.j, 33.j, and 36.j establishes i@{15, 17,34,37}, 
which implies that the antecedent is false. 

Consider statements 14.jf, 16. j, 33. j, and 36. j, where j 7 ^ i. Statements 14.jf 
and 33. j trivially establish or preserve the consequent. By (117), statements 16. j 
and 36. j increment Reset.indx by 1 modulo- A^. Therefore, these statements may 
establish the antecedent of (19) only if executed when i@{9} A Reset.indx = 
{i.y.indx — 1) mod N holds. In this case, by (16), i = Reset.indx or i = i.y.indx 
holds. By (18), the latter implies that i = Reset.indx holds. In either case, 
i = Reset.indx holds. By (17), this implies that i@{9} is false. It follows that 
statements 16. j and 36. j cannot falsify (19). □ 
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invariant F{i) A F{j) A i^j ^ -i(i@{3..6} A j@{3..8, 10..17}) (110) 

Proof: By Lemma 1, the only statement that can establish F(i) is 2. A There- 
fore, the only statements that may falsify (110) are 2 .i and 2.j. Without loss of 
generality, it suffices to consider only statement 2.i. 

Statement 2.i may establish F{i) A i@{3..6} only if Y.free = true. We 
consider two cases. First, suppose that {3j : j ^ i :: F{j) A j@{3..5}) holds 
before 2.i is executed. In this case, X = j holds by the definition of F{j). 
Hence, X ^ which implies that 2.i does not establish F{i). Second, suppose 
that (3jf : j ^ i :: F{j) A j@{6..8, 10..17}) holds before 2.i is executed. In 
this case, by (II), Y.free = false. In either case, statement 2.i cannot establish 
F{i) A i@{3..6}. □ 

invariant F{i) A F{j) A i ^ j ^ -i(i@{7, 8, 10..20} A j@{7, 8, 10..20}) (111) 

Proof: By Lemma 1, the only statement that can establish F{i) is 2.i. However, 
2.i establishes i@{3} and hence cannot falsify (111). The only other statements 
that could potentially falsify (111) are 6.i and 6.j. Without loss of generality, it 
suffices to consider only statement 6.i. 

By Lemma 1, statement 6.i may establish F{i) A i@{7, 8 , 10..20} only if 
F{i) A Infast = false holds before execution. We consider two cases. First, 
suppose that {3j : j ^ i :: F{j) A j@{7, 8, 10. .17}) holds before the execution 
of 6 . A In this case, by (110), F{i) A i@{ 6 } is false. This implies that 6.i cannot 
establish F{i) A i@{7, 8 , 10..20}. Second, suppose that {3j : j ^ i :: F{j) A 
j@{18..20}) holds before 6.i is executed. In this case, by (14), Infast = true. 
Hence, statement 6.i cannot establish i@{7..20|. □ 

invariant F{i) ^ -i(i@{3..5| A j@{31..37|) (H2) 

Proof: By Lemma 1, the only statement that can establish F{i) is 2. A Therefore, 
the only statements that may falsify (112) are 2.i and 30. j. 

Statement 2.i may falsify (112) only if executed when Y.free = true A 
j@{31..37} holds, but this is precluded by (118). Statement 30.j may falsify ( 112 ) 
only if executed when F{i) A i@{3..5j A i j holds. Because statement 30. j 
falsifies X = i, it also falsifies F{i) A i@{3..5|. Thus, it preserves (112). □ 

invariant F{i) ^ -i(i@{6,7} A j@{34..37|) (H3) 

Proof: By Lemma 1, the only statement that can establish F{i) is 2. A However, 
2.i establishes i@{3} and hence cannot falsify (113). The only other statements 
that may potentially falsify (113) are b.i and 33.j. 

Statement 5.i may falsify (113) only if executed when F{i) A j@{34..37} 
holds, but this is precluded by ( 112 ). Statement 33. j may falsify (113) only if 
executed when F{i) A i@{6,7} holds, which, by (120), implies that i.y.free = 
true holds. Because statement 33.j establishes Reset. free = false, Reset 7 ^ i.y 
holds after its execution, which implies that F{i) A i@{ 6 , 7} is false. Therefore, 
statement 33. j preserves (113). □ 
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invariant F{i) ^ 10..19} A j@{35..37}) (114) 

Proof: By Lemma 1, the only statement that can establish F{i) is 2.i. How- 
ever, 2.i establishes i@{3} and hence cannot falsify (114). The only other state- 
ments that could potentially falsify (114) are 7.i and 34. j. Statement 7.i may 
falsify (114) only if executed when F{i) A j@{35..37} holds, but this is pre- 
cluded by (113). 

Statement 34.ji may falsify (114) only if executed when F{i) A i@{8, 10.. 19} A 
Slot[j.y.indx] = false holds. By (122), i@{17..19} and j@{34} cannot hold si- 
multaneously. Thus, 34.jf could potentially falsify (114) only if executed when 
F{i) A i@{8, 10. .16} A Slot[j.y.indx] = /a/se holds. In this case, Slot[i.y.indx] = 
true holds as well, by (12), as does Reset. indx = i.y.indx^ by the definition of F{i) 
and (13). In addition, by (117), j@{34} implies that Reset. indx = j.y.indx holds. 
Combining these assertions, we have Slot[j.y.indx] = false A Slot[j.y.indx] = 
true^ which is a contradiction. Hence, statement 34.j cannot falsify (114). □ 

invariant i@{6..9} A j@{6..17} f\ i j ^ i. y. indx ^ j.y.indx (H5) 

Proof: The only statements that may falsify the consequent are 2.i, 31.i, 2.jf, 
and 31.j. However, the antecedent is false after the execution of each of these 
statements. The only statements that can establish the antecedent are 5.i and 
5.J. We show that b.i does not falsify (115); the reasoning for 5.j is similar. 
Statement b.i can establish the antecedent only if executed when X = i holds. 
By (15), this implies that Reset = i.y holds, which in turn implies that F{i) is 
true. So, assume that X = i A Reset = i.y A F{i) holds before 5.i is executed. 
We analyze three cases, which are defined by considering the value of process j’s 
program counter. 

— Case 1: j@{6..8} holds before b.i is executed. In this case, because F{i) is 
true, by (110), F{j) does not hold. Thus, we have j@{6..8} A ^F{j), which 
implies that Reset ^ j.y. Because Reset = i.y, this implies that i.y ^ j.y. 
In addition, by (120), we have i.y. free = true A j.y. free = true. Thus, the 
consequent of (115) holds before, and hence after, 5.i is executed. 

— Case 2: j@{9} holds before 5.i is executed. In this case, we show that the 
consequent of (115) holds before, and hence after, 5.i is executed. Assume 
to the contrary that i.y. indx = j.y.indx holds before 5.i is executed. Then, 
because Reset = i.y holds, we have j.y.indx = Reset. indx. By (19), this 
implies that Reset. free = false. Because Reset = i.y holds, this implies that 
i.y. free = false holds. However, by (120), we have i.y. free = true, which is 
a contradiction. 

— Case 3: j@{10..17} holds before 5.i is executed. In this case, by (110), 

F{i) A i@{5} is false, which is a contradiction. □ 



The following invariants are straightforward and are stated without proof. 

invariant i@{32,33} ^ Reset = i.y (HO) 

invariant i@{15, 16, 34..36} ^ Reset = {false, i.y. indx) (117) 
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invariant i@{30..37} ^ Y = {false, 0) (118) 

invariant Y.free = true ^ Y = Reset (H9) 

invariant i@{3..20} ^ i.y.free = true (120) 

invariant (i@{5..13, 26..32}) = {Proc[i] = true) (121) 

invariant (Mutual exclusion) |{i :: i@{12..19, 23, 24, 28..38}}| < 1 (122) 



Proof: From the specification of ENTRY_2/EXIT_2 and ENTRYJJ/EXITJJ, (122) 
may fail to hold only if two processes simultaneously execute within statements 
10-20. However, this is precluded by (111). □ 



3.2 Fast Path is Always Open in the Absence of Contention 

Having shown that the mutual exclusion property holds, we now prove that 
when all processes are within their noncritical sections, the fast path is open. 
This property is formally captured by (126) given below. Before proving (126), 
we first present three other invariants; two of these are quite straightforward and 
are stated without proof. 

invariant Slot[k] = true ^ (3i :: i@{8..18} A k = i.y.indx) (123) 

invariant (Vi :: i@{0..2, 18..25, 38, 39}) ^ Y.free = true (124) 

Proof: The only statements that can establish the antecedent are 15. i, 17. i, 34. i, 
35. i, and 37. i. Both 17. i and 37. i establish the consequent. 

Statements 15. i and 35. i can establish the antecedent only if Proc[k] = true, 
where k = i.y.indx. By (121), Proc[k] = true implies that /c@{5..13, 26..32| 
holds, which implies that the antecedent is false. 

Similarly, statement 34. i can establish the antecedent only if Slot[i.y.indx] = 
true. By (123), this implies that (3j :: j@{8..18| A i.y.indx = j.y.indx) holds. 
By (122), j@{12..18| A i@{34} is false. It follows that {3j :: jf@{8..11} A 
i.y.indx = j.y.indx) holds, which implies that the antecedent is false. 

The only statements that can falsify the consequent are 3.i and 29. i. Both 
establish i@{4, 30}, which implies that the antecedent is false. □ 

invariant Infast = true ^ (3i :: i@{11..20}) (125) 

invariant (Fast path is open in the absence of contention) 

(Vi :: i@{0}) ^ Y.free = true A Infast = false A T = Reset (126) 



Proof: If (Vi :: i@{0}) holds, then Y.free = true holds by (124), and Infast = 
false holds by (125). By (119), Y = Reset holds as well. □ 



194 James H. Anderson and Yong-Jik Kim 



4 Concluding Remarks 

In presenting our fast-path algorithm, we have abstracted away from the details 
of the underlying algorithms used to implement the ENTRY and EXIT calls. 
With the ENTRY_2/EXIT_2 calls in Fig. 2 implemented using Yang and Ander- 
son’s two-process algorithm, our fast-path algorithm can be simplified slightly. 
In particular, the writes to the variable Infast can be removed, and the test of 
Infast in statement 6 can be replaced by a test of a similar variable (specifically 
the variable C[0] — see [8]) used in Yang and Anderson’s algorithm. Results by 
Cypher have shown that read/write atomicity is too weak for implementing mu- 
tual exclusion with a constant number of remote memory references per critical 
section access [2]. The actual lower bound established by him is a slow growing 
function of N. We suspect that C(log N) is probably a tight lower bound for this 
problem. At the very least, we know from Cypher’s work that time complexity 
under contention must be a function of N . Thus, mechanisms for achieving 
constant time complexity in the absence of contention should remain of interest 
even if algorithms with better time complexity under contention are developed. 
The problem of implementing a fast-path mechanism bears some resemblance 
to the wait-free long-lived renaming problem [7]. Indeed, thinking about con- 
nections to renaming led us to discover our fast-path algorithm. In principle, 
a fast-path mechanism could be implemented by associating a name with the 
fast path and by having each process attempt to acquire that name in its entry 
section; a process that successfully acquires the fast-path name would release it 
in its exit section. Despite this rather obvious connection, the problem of im- 
plementing a fast-path mechanism is actually a much easier problem than the 
long-lived renaming problem. In particular, while a renaming algorithm must be 
wait-free, most of the steps involved in releasing a “fast-path name” can be done 
within a process’s critical section. Our algorithm heavily exploits this fact. 
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Abstract. The design issues for asynchronous group mutual exclusion 
have been modeled as the Congenial Talking Philosophers, and solutions 
for shared-memory models have been proposed [4]. This paper presents 
an efficient and highly concurrent distributed algorithm for computer 
networks where processes communicate by message passing. 



1 Introduction 

The design issues for mutual exclusion between groups of processes have been 
modeled by Joung [4] as the Congenial Talking Philosophers . The problem con- 
cerns a set of N philosophers • • • ^Pn which spend their time thinking 

alone and talking in a forum. Initially, all philosophers are thinking. From time 
to time, when a philosopher is tired of thinking, it wishes to attend a forum of 
its choice. Given that there is only one meeting room — the shared resource, a 
philosopher attempting to enter the meeting room to attend a forum can succeed 
only if the meeting room is empty (and in this case the philosopher starts the 
forum), or some philosopher interested in the forum is already in the meeting 
room (and in this case the philosopher joins this ongoing forum). We assume 
that when a philosopher has attended a forum, it spends an unpredictable but 
finite amount of time in the forum. After a philosopher leaves a forum (that is, 
exits the meeting room), it returns to thinking.^ The problem is to design an 
algorithm for the philosophers satisfying the following requirements: 

Mutual Exclusion: if some philosopher is in a forum, then no other philoso- 
pher can be in a different forum simultaneously. 

Bounded Delay: a philosopher attempting to attend a forum will eventually 
succeed. 

This research was supported in part by the National Science Council, Taipei, Taiwan, 
under Grants NSC 86-2213-E-002-053 and NSC 87-2218-E-002-050, and by the 1998 
Research Award of College of Management, National Taiwan University. 

^ Throughout the paper, “in a forum” is used synonymously with “in the meeting 
room.” So, “to attend/leave a forum” is synonymously with “to enter/exit the meet- 
ing room.” 
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Concurrent Entering: if some philosophers are interested in a forum and no 
philosopher is interested in a different forum, then the philosophers can at- 
tend the forum concurrently. 



The requirement of “concurrent entering” is to prevent an unnecessary syn- 
chronization on philosophers in attending the same forum when no other philoso- 
phers is interested in a different forum. For example, solutions that simply adopt 
a conventional n-process mutual exclusion algorithm for the problem are obvi- 
ously an overkill. (For a survey of such algorithms see the books [8,1,12,7].) 
Moreover, solutions that may allow philosophers to be in a forum simultane- 
ously, but still require them to compete with one another in some mutually 
exclusive style in order to enter the meeting room are also not proper to this 
problem. 

As usual, we are interested in completely decentralized solutions for the prob- 
lem. A “semi-centralized” solution can be easily derived, for example, by employ- 
ing a “concierge” for each forum. A philosopher interested in a particular forum 
first issues a request to the concierge of the forum. The concierges then compete 
with one another in a mutually exclusive style to obtain a privilege for their 
philosophers to use the meeting room. The algorithm is “semi-centralized” be- 
cause although the contention to the meeting room is resolved decentralizedly by 
the concierges, the decision as when a set of philosophers interested in the same 
forum can enter the meeting room is determined centralizedly by the concierge 
of the forum. So, like other centralized solutions, the algorithm is vulnerable to 
any fault and performance bottleneck of the concierges. Furthermore, the “semi- 
centralized” solution must process each request for a forum in two stages: one 
between the requesting philosopher and the corresponding concierge, and the 
other among the concierges for mutual exclusion. 

In this paper we focus the Congenial Talking Philosophers problem on com- 
puter networks where philosophers communicate by reliable and FIFO message 
passing. Solutions for shared- memory models are treated in [4]. While it is true 
that shared- memory algorithms can be systematically converted to message pass- 
ing (or the other way around, see, e.g., Ch. 17 of [7]), such transformation is gen- 
erally costly. For example, the transformation of the shared-memory algorithm 
presented in [4] may result in an asymmetric solution where some processes are 
designated to maintain the shared variables, usually in a centralized fashion, 
and so the processes are often the bottleneck of the performance; the solution 
also requires many messages (more than — 1)) per entry to the critical sec- 
tion. Thus, it is worth investigating solutions that directly take advantage of 
the underlying features of the execution model. For example, a communication 
imposes a causal ordering between the initiator (the information provider) and 
its target (the information recipient), and the send and receive commands in 
the message-passing paradigm implicitly assumes this causal ordering in the ex- 
ecution. In contrast, a more sophisticated technique is required in a completely 
decentralized shared-memory model to ensure that two asynchronous processes 
engaged in a communication are appropriately synchronized so that the infor- 
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mation provider will not overwrite the information before the other process has 
observed the content. 

Indeed, as we shall see shortly in Section 3, a symmetric and completely 
decentralized solution satisfying the three basic requirements — mutual exclusion, 
bounded delay, and concurrent entering — can be easily devised by modifying 
Ricart and Agrawala’s algorithm [10] for n-process mutual exclusion. This is 
not the case we have experienced in the shared-memory model; the algorithm 
presented in [4] is somewhat complex and is not a straightforward adaption from 
existing algorithms for mutual exclusion. Nevertheless, one is easy to be deceived 
by this simple algorithm: its behavior appears to be fine from static analysis 
until we put it on simulation and learned that it is only slightly better than one 
imposing mutual exclusion on every entry to the critical section! Therefore, it is 
also interesting to see why such a simple modification does not work and how a 
more efficient algorithm can be devised for the problem. 

The rest of the paper is organized as follows: Section 2 presents some crite- 
ria for evaluating solutions for the Congenial Talking Philosophers problem. In 
Section 3 we present the straightforward solution descried above and show why 
the solution has a surprisingly poor performance. Section 4 then presents a more 
concurrent solution. Conclusions and future work are offered in Section 5. 



2 Complexity Measures 

Solutions for the Congenial Talking Philosophers problem can be evaluated by 
the following four perspectives: messages^ time^ context switches^ and degree of 
concurrency. Like the conventional mutual exclusion problem, message complex- 
ity is concerned with the number of messages the system generates per entry to 
the critical section — the meeting room. The other three complexity measures are 
defined below. 

Definition 1. A passage by pi through the meeting room is an interval [^ 1 ,^ 2 ], 
where t\ is the time pi enters the meeting room, and ^2 the time it exits the 
meeting room. The passage is initiated at ti, and is completed at ^ 2 - The passage 
is ongoing at any time in between t\ and ^ 2 - The attribute of the passage is (p^, X), 
where X is the forum pi is attending. 

When no confusion is possible, we use intervals and attributes interchange- 
ably to represent passages. The phrase “a passage through X by pf^ refers to a 
passage with attribute {pi , X) . 

Definition 2. Let S' be a set of intervals. A subset R of S is a minimal cover 
of S if for every o G S, every time instance in a is in some /3 G R (that is, 
V[ti,t 2 ] G S, ti < t < t 2 ^ 3[ts,t4] ^ R^ts < t < t^) and the size of R is 
minimal. The dimension of S, denoted by C(S), is the size of a minimal cover 
of S. 
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Fig. 1. A layout of passages. 



For example, suppose that a philosopher pi has to wait for the following 
passages before it can enter the meeting room: (pi,X), (p 2 ,X), (p 3 ,X), (p 4 ,Y), 
(P 5 ,V), (p6,V), (p 7 ,Z), and (p8,Z) (see Fig. 1). Then the five passages (pi,X), 
(p 3 ,X), (p4,Y), (p7,Z), and (p8,Z) constitute a minimal cover of the set of all 
eight passages drawn in the figure. So, the dimension of the set of the eight pas- 
sages is 5. Because passages may overlap, when measuring the “time” a philoso- 
pher may wait in terms of passages, we cannot directly count all the passages it 
has to wait, but rather the dimension of them. So in this case the five passages 
(pi,X), (p 3 ,X), (p4,Y), (p7,Z), and (p8,Z) suffice to account for p^’s wait. 

Likewise, the time complexity of a solution for the Congenial Talking Philoso- 
phers problem is measured by i?(T), where T is the maximal set of passages that 
may be initiated after a philosopher pi has made a request for a forum, and that 
must be completed before pi can enter the meeting room. Note that the defini- 
tion of T does not include those passages that are initiated after pi has made its 
request, but pi needs not wait for them to be completed in order to enter the 
meeting room; that is, we do not count those passages that may be concurrently 
ongoing with p^’s. 

In practice, a “context switch” occurs when a shared resource is “cleaned” for 
a new group of processes. Depending on the applications, some context switches 
may be very time-consuming. So in the Congenial Talking Philosophers problem 
a philosopher waiting for more passages through the same forum may in practice 
need less time than one waiting for fewer passages through different fora. So, in 
addition to time complexity, we also need a measurement of “context switches” , 
which requires the following definition. 

Definition 3. Let Ux be a set of passages through forum X. Let =min{t| G 
Ux}^ and tf = max {P | [t, P] G Ux}‘ Then, Ux is a round of passages through 
X (or simply a round of X) if: 

1. No passage other than that of Ux is initiated in between ts and tf; and 

2. The last passage initiated before tg and the first passage initiated after tf, 
if any, must be for a different forum. 

In other words, a round of X is a maximal set of consecutive passages through 
forum X. If Ux is a round of X, then we say that the round is initiated at tg, 
and eompleted at t/. 
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The context- switch complexity is measured by the maximum uumber of 
rouuds of passages that may be iuitiated after a philosopher has made a re- 
quest for X, but before a rouud of X is iuitiated iu which pi cau make a passage 
through X. For example, iu Fig. 1, suppose that uo passage is iuitiated before 
(pi,X). Theu, Pi waits for 3 rouuds of passages for its request: a rouud of X, a 
rouud of Y, aud a rouud of Z. 

We measure the degree of concurrency by the maximum uumber of eutries to 
the meetiug room that cau still be made while some philosopher is iu the meetiug 
room aud auother philosopher is waitiug for a differeut forum. (Obviously, if uo 
philosopher is waitiug for a differeut forum while a philosopher is iu the meetiug 
room, theu, iu the preseuce of coucurreut euteriug, every philosopher iuterested 
iu the same forum cau euter /re-euter the meetiug room without auy restrictiou.) 
We choose this measuremeut because wheu a philosopher p is iu the meetiug 
room, uo other philosopher iuterested iu a differeut forum cau use the meetiug 
room. Giveu that p decides ou its owu wheu it will leave the meetiug room, a 
better resource utilizatiou cau be achieved if we will allow more philosophers 
iuterested iu the same forum to share the meetiug room with p. 

Note that the above complexity measures all couceru worst-case sceuarios. It 
is uudoubtedly that au average-case aualysis would be more appealiug for the 
evaluatiou. However, due to the dyuamic uature of the problem, au average- 
case aualysis is extremely complicated. Simulatiou studies [6] are therefore eu- 
couraged to provide some iusight iuto the average-case behavior of a proposed 
solutiou. 

3 A Straightforward Decentralized Solution 

Recall that iu Ricart aud Agrawala’s algorithm [10] for n-process mutual exclu- 
siou, a process requiriug eutry to the critical sectiou multicasts a request message 
to every other process, aud euters the critical sectiou ouly wheu all the other 
processes have replied to this request. To eusure mutual exclusiou aud bouuded 
delay, each process maiutaius a sequeuce uumber SN that is to be updated ac- 
cordiug to Lamport’s causality rules [5]. Each process’s SN is iuitialized to 0. 
Ou issuiug a request, a process pi iucreases its SN by 1, aud attaches the pair 
(i, sni) to the request, where i is the uuique ideutity of the process aud sni is the 
uew value of SN. Formally, (i, sni) is the priority of the request. Upou receiv- 
iug a request with a priority (i, sni), a process pj adjusts the value of its SN to 
max(5'A/', sni), aud uses the followiug rules to decide wheu to reply to the request: 

1. Pj replies immediately if it does uot require eutry to the critical sectiou, or 
it requires eutry to the critical sectiou aud the priority of its request is lower 
thau {i, sni). 

2. the reply is delayed if pj also requires eutry to the critical sectiou aud the 
priority of its request is higher thau {i,sni). The reply is delayed uutil pj 
has exited the critical sectiou. 

The priority is ordered as follows: a priority {i, sni) is higher thau (j, snj), de- 
uoted by {i, sni) > (j, if aud ouly if sni < snj, or sni = snj aud i < j. 
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Ricart and Agrawala’s algorithm can be straightforwardly modified for the 
Congenial Talking Philosophers problem as follows: A philosopher wishing to 
attend forum X multicasts a request to every other philosopher, and enters the 
meeting room when all philosophers have replied to its request. A request mes- 
sage is of the form Re^((i, X), which additionally bears the name of the 

forum the philosopher wishes to attend. Upon receipt of the request, a philoso- 
pher pj uses the following rules to decide when to reply to the request: 

1. Pj replies immediately if it is not interested in a different forum, or it is 
interested in a different forum and the priority of its request is lower than 
that of Pi’s request. (A philosopher is interested in a forum if it has issued a 
request for the forum, and either it is still waiting for attending the forum, 
or it is already in the forum.) 

2. the reply is delayed if pj is interested in a different forum and the priority 
of its request is higher than that of p^’s request. 

We refer to the modified algorithm as RAl. It is easy to see that RAl sat- 
isfies the three requirements — mutual exclusion, bounded delay, and concurrent 
entering. For mutual exclusion, observe that by the way sequence numbers are 
maintained, if a philosopher pj requests a forum after it has received a request 
Req{{i, 5n),X) issued by pi^ then pj must obtain a priority for its request lower 
than (i, sn). So, when pj’s request arrives at pi^ if the request is for a different 
forum, then the reply to the request must be deferred by pi until pi has left X 
(for which the request Req{{i, sn),X) is issued). Therefore, pj cannot attend 
a different forum while pi is in X. Moreover, given that priorities are unique, 
when two philosophers request different fora concurrently, the request issued by 
the low-priority philosopher will be delayed by the high-priority one until the 
high-priority philosopher has passed through a forum. So, the two philosophers 
cannot be in different fora simultaneously. Because neither logically dependent 
requests nor concurrent requests can violate mutual exclusion, mutual exclusion 
is therefore guaranteed. 

For bounded delay, observe that after pi has issued a request Req{{i, 5n),X), 
for each pj^ pi can receive at most one request from pj with a priority higher 
than that of p^’s request. This is because messages over a communication link 
are delivered in the order sent. So, if pi receives a request from pj after pi 
has issued Req{{i^ 5n),X), then before pj makes a new request, it must have re- 
ceived Pi’s reply to its previous request, and so it must have received pi’s request 
Req{{i, sn)^X) sent before the reply. So, pj’s new request must have a priority 
lower than that of pi’s request. So, in between the time (call this ti) pi issues a 
request and the time (call this ^ 2 ) it receives replies from all other philosophers, 
Pj can have at most two requests with a priority higher than that of pi’s request: 
one that pi receives before ti, and the other that pi receives after ti. (We assume 
that a request eeases to exist once the requesting philosopher has completed a 
passage through a forum for the request.) Overall, in between t\ and ^ 2 , there 
are at most 2(A' — 1) requests having a priority higher than that of p^’s request. 
Because the reply to a request can be deferred only if the receiving philosopher is 
interested in a different forum and its request for the forum has a priority higher 
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than that of the former request, and because a philosopher spends only finite 
time in a forum, p^’s request will eventually be replied by all other philosophers. 
Hence, bounded delay is guaranteed. 

For concurrent entering, observe that if a set of philosophers attempt to 
attend a given forum X, then they will not delay one another’s request. So if 
no philosopher is interested in a different forum, then each of the philosophers 
interested in X will be able to obtain a reply from every other philosopher. Since 
no extra synchronization is imposed on how a philosopher replies to the requests 
for X, the requesting philosophers attend X in a purely concurrent style. 

To evaluate the algorithm, consider first message complexity. It is easy to 
see that, like Ricart and Agrawala’s algorithm, RAl’s message complexity is 
2{N — 1): each entry to the meeting room requires N — 1 requests and N — 1 
replies. 

For time complexity, let T be the set of passages that are initiated after pi 
has sent out a request, and that must be completed before pi can enter the 
meeting room. Recall from the previous discussion that in between the time pi 
issues a request and the time it receives replies from all other philosophers, there 
are at most 2{N — 1) requests having a priority higher than that of p^’s request. 
Clearly, only these requests can result in passages in T. So, \T\ < 2{N — 1). It 
is easy to see that the dimension of T is also 2{N — 1). This is because, in the 
worst case, the passages in T may be mutually disjoint. So the time complexity 
of RAl is 2{N — 1). Moreover, from the above discussion it is easy to see that 
RAl’s context-switch complexity is also 2{N — 1). 

Finally, from the above discussion it is also easy to see that if pi has issued 
a request for X, then while a philosopher is occupying the meeting room for 
forum Y, at most 2{N — 2) entries to the meeting room can be made before pi 
enters the meeting room. Therefore, the maximum degree of concurrency offered 
by the algorithm is 2{N — 2). 

The simulation results are summarized in Fig. 4 in the appendix. The results 
indicate that the behavior of the system is only slightly better than the case 
where the philosophers use the meeting room in a mutually exclusive style! 

We now analyze the simulation results from a more static point of view. 
Assume there are m fora Xi, . . . ,X^. Let N = m • and assume that all N 
philosophers wish to attend a forum, where for each of the m fora k philosophers 
have requested the forum. Suppose that the philosophers’ priorities are ordered 
decreasingly as follows: 






5Pm,l,Pl,2,P2,2, • • • ,Pm,2,--- , Pl,/c, 4>2,/c, • • • ,Pm,k 



where pij denotes the j th philosopher that is interested in X^. By the algorithm, 
the m • k requests will yield m • k rounds of passages (because the philosophers 
will enter the meeting room in the order of their priorities), where each round 
consists of only one passage. 

Consider the expected concurrency behavior of the above example. Observe 
that the total number of rounds of passages these m • k requests may generate 
depends on how their priorities are ordered. Suppose that of the (m • /c)l possible 
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orderings, the m • k requests have an equal probability to assume each ordering. 
Let E{m, k) be the expected number of rounds of passages the requests may 
generate. In combinatorics, E{m^ k) is equivalent to the expected number of 
runs m • k balls, k balls per color, may generate when they are randomly placed 
on a line, where a run is defined to be a maximal sequence of balls of the same 
color. By a combinatorial analysis [2], we have 

E{m, k) = mk — /c + 1 

Given that there are m • k requests, the expected number of passages per round 
is 



mk mk m 

mk — k 1 mk — k m — 1 

(One may compare this value with the measured average round size in our sim- 
ulation for RAl.) Therefore, when m is large, the expected number of passages 
per round is nearly 1, meaning that the concurrency of RAl is almost the same 
as one applying a conventional mutual exclusion algorithm such as Ricart and 
Agrawala’s to strictly impose mutual exclusion on every entry to the meeting 
room! 

4 A Highly Concurrent Solution 

The poor average-case concurrency behavior of RAl is due to the fact that if 
two philosophers pi and pj are interested in the same forum, but a third philoso- 
pher pk interested in a different forum has obtained a priority in between p^’s 
and Pj’s, then pi and pj cannot attend a forum concurrently because the low- 
priority philosopher (say pj) must wait for pk to finish a forum before it can 
attend a forum. Intuitively, pj should be allowed to join p^’s forum, for other- 
wise concurrency would not be increased. To do so, while pi is in a forum, if 
it receives pj’s request for the same forum, then we let pi reply to pj with a 
start message to inform pj that it can directly enter the meeting room to join 
the ongoing forum. That is, we allow a high-priority philosopher to capture a 
low-priority one to join a forum. (A captured philosopher cannot in turn capture 
other philosophers to avoid a possibility of livelock, e.g., two philosophers repeat- 
edly capturing each other to attend a forum while blocking a third philosopher 
waiting for a different forum.) Obviously, p^’s entry to the meeting room must 
be known by pk because pk may have already received pj’s reply and is waiting 
for only p^’s reply. So, when pi leaves the forum, in the reply to p/^’s request, pi 
must inform pk that it has captured pj to attend a forum, pk then uses this in- 
formation to decide if it has an “up-to-date” reply from pj. If so, then pk is sure 
that Pj has also exited the forum and so can enter the meeting room without 
violating mutual exclusion. On the other hand, if the most recent reply from pj 
is “out-of-date” , then pk must wait until pj sends it a new reply. Clearly, pj must 
send such a reply on exiting the meeting room. 
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Note that because pj may have already replied to p/^’s request before it is 
captured by it may reply to pkS request twice (where the other is sent upon 
exiting the meeting room). The extra reply is offset by the dispensation of pkS 
reply to p^’s request because p^’s request has already been granted by pi. So 
the overall messages required per entry to the meeting room are not increased 
by the extra reply. However, pj’s request may arrive at pi before pi enters the 
meeting room. Because philosophers interested in the same forum will not block 
each other’s request (so as to ensure concurrent entering), upon receipt of pj’s 
request pi must reply to the request immediately. When later pi is allowed to 
use the meeting room, it has to send a start message to capture pj. So, pi may 
respond to pj’s request twice, yielding the number of messages required per 
entry to the meeting room to be increased by A/" — 1 in the worst case (where all 
philosophers have requested the same forum). 

Moreover, from the above discussion we see that sequence numbers along 
do not convey enough information for philosophers to decide whether a reply 
to a request is out-of-date. So, in the new algorithm, which we shall refer to as 
RA2, we let each philosopher maintain a vector sequence number VSN i [9,11]. 
VSNi is a vector of natural numbers of length N and is initialized to contain 
all zeros. The value VSN i[j] represents a count of requests that have occurred 
at Pj and that are known at pi^ either because they originate there (when j = i) 
or because their existence is known about through message passing. Let VSAf 
denote the set of vectors of natural numbers of length and JV denote the set 
of natural numbers. The binary relation ‘<’ on Af x VSAf is defined as follows: 

{k, v) < {j, u) iff y] v[l] > y] u[l] or v[l] = ^ u[l] Ak> j) 

I I I I 

It can be seen that the binary relation ‘<’ on Af x VSAf is antisymmetric and 
transitive; and for any two distinct pairs {j^u) and (/c,u), either (j^u) < {k^v) 
or (k,v) < {j,u). 

Furthermore, pi also maintains a vector VF i of natural numbers of length N, 
where VFi[j], initialized to 0, represents a count of entries to the meeting room 
that are made by pj and that are known at pi . 

When a philosopher pi wishes to attend a forum X, it increments VSNi[i] 
by 1 and, like RAl, pi multicasts a request message Req{{i, vsni),X) to every 
other philosopher, where vsui is the new value of VSNi. The value (i, vsni) is the 
priority of the request. A priority (j, u) is higher than (/c, v) if and only if (j, u) > 
(/c, v). Unlike RAl, however, pi may enter the meeting room either because every 
philosopher has replied to its request with an “up-to-date” acknowledgment Ack^ 
or because some philosopher has replied to its request with a message Start. In 
the former case, we say that pi enters the meeting room as a captor (and acts 
as a captor in the meeting room) , while in the latter case pi enters the meeting 
room as a captive. 

Upon receiving a message Req{{i, vsni),X), pj updates VSN j to merge{ VSN j, 
vsni)^ where function merge{u^v) is defined as follows: 

merge{u^v)[k] = mdix{u[k],v[k]), l<k<N 
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It can be seen that if a request with priority (j, u) (logically) happens before a 
request with priority {k^v)^ then > {k^v). 

The rules to decide when and how to reply to the request are as follows: 

1. pj replies with an acknowledgment Ack{j, vsrii[i], vfj) (where vfj is the cur- 
rent value of VF j) immediately if either (1) it is also interested in X but 
is not currently acting as a captor in the meeting room, or (2) it is not 
interested in X, is not in the meeting room, and has a priority lower than 
(i, vsrii). (We assume that a philosopher pj possesses a priority all the time, 
where the priority is set to a minimal value (j, oo) if pj is not interested in 
a forum, and is set to (j, vsrij) if pj has requested a forum and the priority 
of its request is (j, vsrij).) 

2. Pj replies with a message Start{{j, vsrij), vsrii[i]) if it is in forum X and is 
acting as a captor in the meeting room. Note that the reply bears pj’s cur- 
rent priority {j, vsrij). 

3. Otherwise, pj must be interested in a different forum, and either pj is in 
the meeting room, or it has a priority higher than (i, vsrii). In this case, pj 
delays the reply until it has exited the meeting room, and then replies with 
an acknowledgment. 

Observe that an acknowledgment Ack{j, vsrii[i], vfj) in the algorithm addi- 
tionally carries two values: vsrii[i] — piA sequence number in its request, and 
vfj — pjA knowledge (at the time the acknowledgment is sent) of the counts of 
entries to the meeting room made by each philosopher. The value vsrii[i] is used 
by Pi to determine whether the acknowledgment is for its current request, or 
for a previous request (and in this case the acknowledgment is “out-of-date” 
and must be discarded). A philosopher may receive an out-of-date acknowledg- 
ment because it can enter the meeting room (as a captive) before it receives all 
acknowledgments for its request. So, pfs request Req{{i, vsrii), X) may have al- 
ready ceased to exist when pi receives Ack{j, vsrii[i], vfj). By comparing vsrii[i] 
with piA sequence number in its current request, pi can decide whether the 
acknowledgment is out-of-date. 

The value of vfj is for pi to update its VF i'. On receipt of Ack{j, vsrii[i], vfj),pi 
updates VF i to merge{VF i, vfj). The new value of VF i then is used by pi to 
check if every acknowledgment Ack{k, vsrii[i], vf i^) it possesses remains up-to- 
date. In the algorithm, VF i[k] must be no greater than vfj^[k] in order for 
Ack{k, vsrii[i], vf f^) to remain up-to-date. The intuition for this is as follows: 
VF k [k] must always have a correct count of the number of entries to the meet- 
ing room pk has made. So, after pi receives Ack{k, vsrii[i], vf j^), pfs VF i[k] 
must be updated to vff.[k]. Suppose pk is interested in a different forum. Let 
Req{{k, vsrik),y) be pkA request. If pk does not enter the meeting room before pi 
does, then p/^’s acknowledgment remains up-to-date because VF i[k] continues to 
have a value equal to vfj^[k]. The fact that pk acknowledges pfs request guaran- 
tees that p/c’s request has a priority lower than that of pfs request. So, pk cannot 
receive pfs acknowledgment until pfs request Req{{i, vsrii), X) ceases to exist. 
So, until Pi has left forum X, pk cannot enter the meeting room as a captor. 
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However, if some philosopher pi is currently in forum Y acting as a captor, 
then Pi may reply to p^s request with a message Start {{I ^ vsni), vsnk[k]). So, pk 
may enter the meeting room before pi. Note again that the fact that pi is in 
the meeting room acting as a captor guarantees that p/’s request must have a 
priority higher than that of p^’s request. So pi cannot enter the meeting room 
as a captor until pi has replied to its request with an acknowledgment, which 
cannot be sent until pi has exited the meeting room. When pi replies to p^’s re- 
quest with an acknowledgment Ack{l^ vsrii[i], vfi), for pi to learn from pi that pk 
has been captured to enter the meeting room (and so p/^’s acknowledgment 
Ack{k, vsrii[i], vfj.) is out-of-date), the count vfi[k] is set to vsnk[k] (which must 
be equal to vfj.[k] + 1). When pi receives p/’s acknowledgment and updates its 
VFi^ it knows that the current value of VF i[k] is one greater than vfj^[k] car- 
ried by Ack{k, vsrii[i], knows that the acknowledgment is out-of-date 

and it must wait for a new acknowledgment from pk in order for it to enter the 
meeting room as a captor. 

The complete code of the algorithm is given in Fig. 2 as a CSP-like repetitive 
command consisting of five guarded commands A, B, C, D, and E [3]. 

The variables used in the algorithm are listed below: 

— state: the state of pi (see Fig. 3). The initial state is thinking. 

— VSNi'. a vector of natural numbers of length N ^ where VSNi[j]., initialized 
to 0, represents a count of requests that are made by pj and that are known 
at Pi. 

— VFp. a vector of natural numbers of length where VFi[j], initialized to 0, 
represents a count of entries to the meeting room that are made by pj and 
that are known at pi. 

— priority: the priority of pi. It is initialized to a minimal value (i, oo). 

— target: the forum pi wishes to attend, or T otherwise. It is initialized to T. 

— is-eaptor: a boolean variable indicating if pi is acting as a captor in the 
meeting room. It is initialized to false. 

— request -queue: a queue of requests of size at most A" — 1. It is initialized to 
0, and is used to store the most recent requests by every other philosopher. 

— friend -requests: a queue of requests that are for the same forum as pfs re- 
quest and that are received while pi is in the meeting room. 

— aek -queue: the set of acknowledgments to p^’s request. It is initialized to 0 

In the algorithm, request messages are processed by guarded command B. 
Note that after pi has replied to Req{{j, vsnj),Y) with an acknowledgment 
Aek{i, vsnj[j]^ VFi)^ Pi may need to send a new reply to the request. This hap- 
pens when {!) Pi subsequently enters the meeting room as a captor and wishes 
to capture pj to join the forum, or (2) pi is captured to attend the meeting room 
and needs to send a new acknowledgment to pj after it has exited the meeting 
room. Therefore, a request message may need to be saved for another reply. In 
the algorithm, request messages are kept in request -queue. Obviously, only the 
most recent request per philosopher needs to be saved. Lines B.7-13 implement 
the three rules described earlier regarding how request messages are processed. 
Each new request is kept in request -queue (line B.6) no matter how responds 
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A.l 

A.2 

A.3 

A.4 

A.5 

A. 6 

B. l 
B.2 
B.3 
B.4 
B.5 
B.6 
B.7 

B.8 

B.9 

B.IO 

B.ll 

B.12 

B.13 

B.14 

B.15 

B.16 

B. 17 

C. l 
C.2 
C.3 
C.4 

C.5 

C.6 

C.7 

C.8 

C.9 

C.IO 

C.ll 

C.12 

C.13 

C.14 

C.15 

C.16 

C.17 

C.18 

C.19 

C.20 

C.21 

C.22 

C.23 

C.24 



*[ wish to attend a forum of X — ^ 

VSNi\i] := VSNi\i] + 1; 
target := X; 

priority := (z, VSNi); /* obtain a priority for its request */ 
state := waiting; 

multicast Req{{i, VSNi),X) to every other philosopher; /*issue a request*/ 

□ receive Req{{j, vsnj),Y) — ^ 

VSNi := merge{VSNi, vsuj); /* adjust VSNi */ 
if 3 Req{{j, vsn'j), Z) G request -queue then /* remove p^’s previous request */ 
remove Req{{j, vsn'j),Z) from request -queue; 
if VF^\j\ < vsnj[j\ then { 

add Req{{j, vsnj),Y) to request -queue; 

if {target = Y /\ ^is-eaptor) \/ {target / Y /\ state ^ talking /\ 
priority < {j, vsnj)) then 
send Aek{i, vsnj[j], VF i) to pj; 
else if target =Y /\ is-eaptor then { 

send Start {priority ^ '^snj[j]) to pj; /* pi captures pj */ 

/*Pi is sure that pj will make its vsnj[j] th entry to the meeting room*/ 
VFi\j\ := max( z;s% [j]); } 

/* otherwise, target / Y /\ {state = talking \/ priority > {j, vsnj)) */ 
if target =Y /\ state = talking then 

/* Pi needs not re-send a reply to the request on exiting the 
meeting room * / 

add Req{{j, vsnj),Y) to friend -requests; 

} /* else the request is out-of-date */ 

□ receive Aek{j, sn, vfj) — ^ 

VF i := merge{VFi^vf j); /* adjust VF i */ 
let priority be (z, vsni); 
if vsni[i] = sn /\ state ^ talking then { 

/* otherwise the acknowledgment is out-of-date */ 

if 3Aek{j^ sn' , vf'j) G aek-queue then /* remove p^’s previous reply */ 
remove Aek{j, sn' , vf'j) from aek-queue; 
add Aek{j, sn, vfj) to aek-queue; 
for each Aek{k, sn" , vf G aek-queue do 
/* remove out-of-date acknowledgments */ 

if vf ^[k] < VF i[k] then remove Aek{k, sn" ,vf from aek-queue; 
if \aek-queue\ = N — 1 then { 

/* receive acknowledgments from every other philosopher */ 
state := talkinq; /* enter the meeting room as a captor */ 

VFi\i] := VFi[i] + 1; 
is-eaptor := true; 
aek-queue := 0; 

for each Req{{l,vsn{),Y) G request-queue do 
if target = Y A VFi\l\ < vsni\l] then { 

/* capture the congenial philosopher */ 
send Start {priority , vsni[l]) to pi; 

VFi[l] := ui8ix{VFi[l], vsni[l]); } 

else if {I, vsni) > priority then /* the request is out-of-date */ 
remove Req{{l, vsni),Y) from request -queue; 

} /* end of if-then statement in line C.IO */ 

} /* end of if-then statement in line C.4 */ 



Fig. 2. Algorithm RA2 executed by philosopher p^. 
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D.l □ receive Start {{j, vsrij), sn) — ^ 

D.2 let priority be (i,vsrii); 

D.3 if vsrii[i] = sn f\ state ^ talking then { /* captured by pi */ 

D.4 state := talking] /* enter the meeting room as a captive */ 

D.5 VFi\i] := VFi\i]Fl; 

D.6 aek-Queue := 0; 

D.7 for each Req{{l, vsni),Y) G request-queue do 

D.8 /*remove the captor’s request and the requests that bear a priority*/ 

/* higher than that of the captor */ 

D.9 if (l,vsni)>(j,vsnj) then remove Req{{l, vsni\Y) from request -queue] 

D. IO /* else, out-of-date message */ } 

E. l □ exit a forum of target — > 

E.2 for each Req((j, vsnj) ,Y) G request-queue do 

E.3 if VF i[j] < vsnj[j] f\ Req((j, vsnj) ,Y) ^ friend -requests then 

E.4 send Aek{i,vsnj[j], VFi) to pj] 

E.5 else if VF i[j] > vsnj[j] then /* the request is out-of-date */ 

E.6 remove Req{{j^vsnj)^Y) from request -queue] 

E.7 target T; 

E.8 friend -requests := 0; 

E.9 priority := (i,oo); 

E.IO is-captor := false] 

E.ll state := thinking] 

E.12 ] 

Fig. 2. Algorithm RA2 executed by philosopher pi (continued). 

to the request (unless the request is out-of-date; see the comment at the end of 
this section). 

Acknowledgments are processed by guarded command C. Pi enters the meet- 
ing room as a captor when its request has been acknowledged by all the other 
philosophers (line C.IO). Then, for each request for X in request -queue ^ pi must 
send a start message to capture the requesting philosopher to enter the meeting 
room (lines C. 17-20). Moreover, the requests in request -queue bearing a priority 
higher than that of pfs request can also be deleted (lines C. 2 1-22) because the 
requesting philosophers must have already entered the meeting room for their 
requests (for otherwise they will not acknowledge pfs request). If pi instead 
enters the meeting room as a captive (line D.3), then it removes the captor’s 
request and the requests in request-queue that bear a priority higher than that 
of the captor’s request (lines D.7-9). Again, this is because the requests must 
have already ceased to exist; for, otherwise, pfs captor would not be able to 
enter the meeting room to capture pi. The captor’s priority carried by each start 
message is used for this purpose. 

On exiting the meeting room, pi must reply to the requests that are held 
in request -queue ^ including those that pi may have already replied {pfs replies 
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to them become out-of-date because pi has been captured to enter the meeting 
room). Clearly, not every request needs to be replied. For example, if the value 
of vsrijlj] in a request Req{{j, vsnj),Y) is less than or equal to VF i[j]^ then, 
obviously, the request has been granted and so no reply is needed. Moreover, if a 
request Req{{j, vsrij),Y) arrives while pi is in forum Y, then even though pi does 
not act as a captor in the meeting room, pi must still acknowledge the request 
immediately so as to guarantee concurrent entering. Clearly, the acknowledgment 
must carry the most up-to-date count of entries pi has made to the meeting room 
(including the one pi has just made to enter the forum Y). A new acknowledgment 
sent by pi upon leaving the forum therefore does not carry more information than 
is needed and, so, can be saved. The queue friend -requests is used to store such 
requests (lines B. 14-16 and E.3). The rule for determining which request needs 
to be replied and which needs to be removed is given in lines E.2-6. 

Note that after pi has replied to a request (say, by pj) of the same interest 
(say, forum Y), pi may still need to keep the request in request -queue because 
there is no guarantee that pj’s request will be granted before or concurrently 
with pi^ regardless of whose priority is higher. Eor example, suppose pi is cap- 
tured by pk to attend Y. Suppose further that pj’s request arrives at pk after pk 
has left Y (so that pk is unable to capture pj in time), and pj’s request arrives 
at Pi before pk has captured pi. Then, when pi acknowledges pj’s request, pj’s 
knowledge of VF i[i] is one less than the count pk will inform pj via pkS ac- 
knowledgment. So, Pj will consider pfs acknowledgment out-of-date and will be 
waiting for a new acknowledgment from pi. So, pi must send the acknowledg- 
ment for, otherwise, the system would be deadlocked. Moreover, after pi has sent 
the new acknowledgment, pi may request another forum and then be captured 
again to enter the meeting room before pj is allowed to enter the meeting room. 
So, Pi’s second acknowledgment may again become out-of-date. So, in the worst 
case. Pi has to maintain p^’s request until it has learned that pj’s request has 
ceased to exist (or is about to cease to exist). 

The fact that a philosopher may be captured over and over again also reveals 
that the algorithm provides a virtually unbounded degree of concurrency. Note, 
however, that this does not mean that a captor can capture a philosopher an 
infinite number of times; for otherwise, it would contradict our assumption that 
every philosopher spends only finite time in the meeting room. 

Einally, because a philosopher p/^ sets its VF k[j] to vsnj[j] upon capturing pj 
with a request Req{{j, vsnj),X)^ at some point VF k[j] may be greater than 
VF j[j] (whose value must be equal to vsuj[j] — 1 when pj makes the request 
Req{{j, vsnj),X)). Moreover, another philosopher p^ may learn of the new VF k[j] 
before pj has received p/^’s start message (and so before pj has made its vsuj [j] th 
entry to the meeting room). As a result, care must be taken to prevent pj from 
miscounting the number of entries it has made to the meeting room; otherwise, 
the algorithm could err. In the algorithm, a Req{{j, vsuj),X) arrives at pi is 
considered out-of-date and is removed right away if VFi[j] > vsuj[j] (line B.5). 
Therefore, pj will never receive an acknowledgment Aek{i, vsuj[j]^ vfi) from pi 
whose vf^[j] is greater than VF j[j]. 
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Due to the space limitation, the correctness of RA2 and its performance will 
be analyzed in the full paper. 

5 Conclusions and Future Work 

We have presented an efficient and highly concurrent distributed algorithm RA2 
for the Congenial Talking Philosophers problem. The algorithm requires, on 
an average, N to 3{N — 1) messages per entry to the meeting room. In terms 
of context-switch complexity, when a philosopher requests a forum X, at most 
2{N — 1) rounds of passages can be initiated before a round of X is initiated 
in which the philosopher can make a passage through X. Within each round, at 
most 0{N) passages suffice to account for the blocking of another philosopher’s 
request. So the time complexity is In terms of concurrency, RA2 can 

offer a virtually unbounded degree of concurrency. 

For comparison, we have also presented RAl, which is a straightforward mod- 
ification from Ricart and Agrawala’s algorithm for n-process mutual exclusion. 
RAl requires 2(A^— 1) messages per entry to the meeting room. Its context-switch 
complexity and time complexity are both 2{N — 1), and offers a maximum de- 
gree of concurrency of 2{N — 2). Although, statically, the two algorithms do not 
differ very much from various perspectives, the difference between their dynamic 
performance is quite significant. As illustrated by the simulation, RAl performs 
poorly as compared to RA2; it out-performs RA2 only in message complexity, 
but only by a slight margin. 

The comparison between RAl and RA2 also highlights several directions for 
future work. For example, RA2 requires, in the worst case, 3{N — 1) messages 
per entry to the meeting room (although in simulation the number of messages 
required is approximately 2.3{N — 1)), but RAl needs only 2{N — 1) messages. 
Moreover, in RA2 each request bears a vector timestamp, while only a timestamp 
is needed in RAl. The use of vector timestamps compromises the scalability of 
RA2. Thus, it is interesting to see if RA2 can be further improved to reduce its 
message size and the number of messages needed per entry to the meeting room. 
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Appendix A: Some simulation results 

The Appendix presents some simulation results for both RAl and RA2. In the 
simulation, we have set up a system of N philosophers and m fora. Each time 
a philosopher wishes to attend a forum, it randomly chooses one of the m fora 
to attend, and the choice follows a uniform distribution. The time a philoso- 
pher stays in states thinking and talking follows an exponential distribution 
with means ^thinking and ^talking respectively. The message transmission time 
also follows an exponential distribution with a mean Runk. delay We measure the 
following values: 

— The average time a philosopher spends in state waiting; i.e., the average 
waiting time. 

— The average number of rounds of passages a philosopher has bypassed in 
state waiting; i.e., the average number of context switches a philosopher has 
to wait per request. 

— The average number of passages per round; i.e., the average round size. 

— The average capacity; i.e., on an average, the maximum number of philoso- 
phers that can be in the meeting room simultaneously per round. 

— The average number of messages required per entry to the meeting room. 

— Throughput; i.e., the average number of entries to the meeting room per 
second. 

The simulation program is written in Java using Java Development Kit VI. 02. 
Fig. 4 summarizes some of our simulation results.^ For comparison, we have also 
measured the behavior of the algorithm when philosophers use the meeting room 
in a mutually exclusive style. This is done by designating one unique forum to 
each philosopher. In the figure, we use m = V* to denote this scenario. We have 
also set up the case m = 1 where maximum concurrency should be allowed as 
no two requests will conflict. It is not difficult to see that the setting of Fig. 4 
represents a very high contention situation to the meeting room. 

From Fig. 4 we can see that RAl provides virtually no concurrency. For 
example, when m = 3, it is likely that one third of the waiting requests are 

^ Some java applets animating the algorithms for the Gongenial Talking Philosophers 
problem can be found in http://joung.im.ntu.edu.tw/congenial/. 
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Fig. 3. State transition diagram of a philosopher. 
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Fig. 4. Some simulation results 



targeting at the same forum very often. However, the simulation results indicate 
that the behavior of the system is only slightly better than the case m = 30* 
where the philosophers use the meeting room in a mutually exclusive style. On 
the other hand, the performance is improved significantly by RA2. For example, 
when m = 3, the average context switches per request is 1.2, as opposed to 19.1 
for Algorithm RAl. Moreover, on an average, up to 16.27 philosophers can be 
in the meeting room simultaneously per round for RA2, as opposed to 1.51 for 
RAl. Note that the number of messages required per entry to the meeting room 
is 2{N — 1) = 58 for RAl, but this number is not fixed for different values of m. 
As we can see, the message complexity is 2(A^ — 1) when no two philosophers 
may request the same forum simultaneously; otherwise, the message complexity 
ranges in between N and 3(A/' — 1). The message complexity increases as m 
decreases. This is also witnessed by the simulation results. For example, when 
m = 20, the average number of messages required per entry to the meeting room 
is 59.41 = 2.05(A^ — 1). The number increases to 66.83 = 2.30(A^ — 1) for m = 3, 
and remains about the same value for the worst case where m = 1. 
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Abstract. Concurrent programs often encounter failures, such as races, 
owing to the presence of synchronization faults (bugs) . One existing tech- 
nique to tolerate synchronization faults is to roll back the program to a 
previous state and re-execute, in the hope that the failure does not recur. 
Instead of relying on chance, our approach is to control the re-execution 
in order to avoid a recurrence of the synchronization failure. The control 
is achieved by tracing information during an execution and using this 
information to add synchronizations during the re-execution. 

The approach gives rise to a general problem, called the off-line predicate 
control problem^ which takes a computation and a property specified on 
the computation, and outputs a “controlled” computation that maintains 
the property. We solve the predicate control problem for the mutual 
exclusion property, which is especially important in synchronization fault 
tolerance. 



1 Introduction 

Concurrent programs are difficult to write. The programmer is presented with 
the task of balancing two competing forces: safety and liveness [8]. Frequently, 
the programmer leans too much in one of the two directions, causing either safety 
failures (e.g. races) or liveness failures (e.g. deadlocks). Such failures arise from a 
particular kind of software fault (bug), known as a synchronization fault. Studies 
have shown that synchronization faults account for a sizeable fraction of observed 
software faults in concurrent programs [6] . Locating synchronization faults and 
eliminating them by reprogramming is always the best strategy. However, many 
systems must maintain availability in spite of software failures. It is, therefore, 
desirable to be able to bypass a synchronization fault and recover from the 
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resulting failure. This problem of software fault tolerance for synchronization 
faults in concurrent programs ^ is the primary motivation for this paper. 

Traditionally, it was believed that software failures are permanent in nature 
and, therefore, they would recur in every execution of the program with the 
same inputs. This belief led to the use of design diversity to recover from soft- 
ware failures. In approaches based on design diversity [1,13], redundant modules 
with different designs are used, ensuring that there is no single point-of-failure. 
Contrary to this belief, it was observed that many software failures are, in fact, 
transient - they may not recur when the program is re-executed with the same 
inputs [3]. In particular, the failures caused by synchronization faults are usually 
transient in nature. 

The existence of transient software failures motivated a new approach to 
software fault tolerance based on rolling back the processes to a previous state 
and then restarting them (possibly with message reordering), in the hope that 
the transient failure will not recur in the new execution [5,17]. Methods based on 
this approach have relied on chance in order to recover from a transient software 
failure. In the special case of synchronization faults, however, it is possible to 
do better. Instead of leaving recovery to chance, our approach ensures that the 
transient synchronization failure does not recur. It does so by controlling the 
re-execution, based on information traced during the failed execution. 

Our mechanism involves (i) tracing an execution, (ii) detecting a synchro- 
nization failure, (hi) determining a control strategy, and (iv) re-executing under 
control. Each of these problems is also of independent interest. Our require- 
ments for tracing an execution and re-execution under control are very sim- 
ilar to trace-and-replay techniques in concurrent debugging. Trace-and-replay 
techniques have been studied in various concurrent paradigms such as message- 
passing parallel programs [12] , shared- memory parallel programs [11] , distributed 
shared memory programs [15], and multi-threaded programs [2]. Among sychro- 
nization failures, this paper will focus on races. Race detection has been previ- 
ously studied [4,10]. We will discuss tracing, failure detection, and re-execution 
under control in greater depth in Section 4. This paper addresses the remaining 
problem of determining a control strategy. 

To illustrate what determining a control strategy involves, consider the exe- 
cution shown in Figure 1(a). CSl - CS4 are the critical sections of the execution. 
The synchronizations between processes are shown as arrows from one process 
execution to another. A synchronization ensures that the execution after the head 
of the arrow can proceed only after the execution before the tail has completed. 
A race occurs when two critical sections execute simultaneously. For example, 
CSl and CS2 may have a race, since the synchronizations do not prevent them 
from executing simultaneously. A control strategy is a set of added synehroniza- 



^ By concurrent programs, we include all parallel programming paradigms such as: 
multi- threaded programs, shared-memory parallel programs, message-passing dis- 
tributed programs, distributed shared-memory programs, etc. We will refer to a 
parallel entity as a process, although in practice it may also be a thread. 
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CSl CS4 CSl CS4 

PI - — ► 

P2- 

CS3 ' 

P3 ^ I ► 

(a) Trace (b) Trace with added synchronizations 

Fig. 1. Example: Tracing and Controlling During Rollback Recovery 





tions that would ensure that a race does not occur. The race can be avoided by 
adding synchronizations, shown as broken arrows in Figure 1(b). 

The focus of this paper is the problem of determining which synchroniza- 
tions to add to an execution trace in order to tolerate a synchronization fault in 
the re-execution. This proves to be an important problem in its own right and 
can be applied in areas other than software fault tolerance, such as concurrent 
debugging. We generalize the problem using a framework known as the off-line 
predicate control problem. This problem was introduced in [16], where it was 
applied to concurrent debugging. Informally, off-line predicate control specifies 
that, given a computation and a property on the computation, one must deter- 
mine a controlled computation (one with more synchronizations) that maintains 
the property. (We will use the term computation for a formal model of an exe- 
cution.) The previous work [16] solved the predicate control problem for a class 
of properties called disjunctive predicates. Applying the results of that study 
to software fault tolerance would mean avoiding synchronization failures of the 
form: A /2 A/ 3 , where U is a local property specified on process Pi. For example, 

if li specifies that a server is unavailable, the synchronization failure is that all 
servers are unavailable at the same time. 

In this paper, we address a class of off-line predicate control problems, char- 
acterized by the mutual exclusion property, that is especially useful in tolerating 
races. We consider four classes of mutual exclusion properties: off-line mutual 
exclusion^ off-line readers writers^ off-line independent mutual exclusion^ and 
off-line independent read-write mutual exclusion. For each of these classes of 
properties, we determine necessary and sufficient conditions under which the 
problem may be solved. Furthermore, we design an efficient algorithm that solves 
the most general of the problems, off-line independent read- write mutual exclu- 
sion, and thus also solves each of the other three problems. The algorithm takes 
0{np) time, where n is the number of concurrent processes and p is the number 
of critical sections. 

The problems have been termed off-line problems to distinguish them from 
their more popular on-line variants (i.e. the usual mutual exclusion prob- 
lems [14]). The difference between the on-line and off-line problems is that in the 
on-line case, the computation is provided on-line, whereas in the off-line case, 
the computation is known a priori. Ignorance of the future makes on-line mutual 
exclusion a harder problem to solve. In general, in on-line mutual exclusion, one 
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cannot avoid deadlocks without making some assumptions (e.g. critical sections 
do not block). Thus, on-line mutual exclusion is impossible to solve. To under- 
stand why this is true, consider the scenario in Figure 1. Any on-line algorithm, 
being unaware of the future computation, would have a symmetric choice of en- 
tering CSl or CS2 first. If CS2 is entered first, it would result in a deadlock. 
An off-line algorithm, being aware of the future computation, could make the 
correct decision to enter CSl first and add a synchronization from CSl to CS2. A 
proof of the impossibility of on-line mutual exclusion follows along similar lines 
as the proof of Theorem 3 in [16]. Thus, there will always be scenarios where 
on-line mutual exclusion algorithms will fail, resulting in either race conditions 
or deadlocks. In such scenarios, controlled re-execution based on off-line mutual 
exclusion becomes vitally important. 

2 Model and Problem Statement 

The model that we present is of a single execution of the concurrent program. 
The model is not at the programming language level, but at a lower level, at 
which the execution consists of a sequence of states for each process and the 
communications that occurred among them (similar to the happened before 
model[7]). 

Let S' be a finite set of elementary entities known as states. S is partitioned 
into subsets Si, S 2 , • • • , Sn, where n > 1. These partitions correspond to n pro- 
cesses in the system. A subset G of S is called a global state iff Vi : |GnSi| = 1. 
Let Gi denote the unique element in GflSi. A global predieate is a function that 
maps a global state onto a boolean value. 

A eomputation is a partial order ^ on S such that Vi : is a total order 

on Si, where represents ^ restricted to the set Si. Note that the states in a 
single process are totally ordered while the states across processes are partially 
ordered. We will use to denote computations, and ||, ||^, ||*^ to denote 

the respective incomparability relations (e.g. s \\ t = {s t) A {t s)) . Given 

a computation ^ and a subset K of S, ^-eonsistent{K) = Vs, t e K : s || t. In 
particular, a global state may be ^-consistent. The notion of consistency tells 
us when a set of states could have occurred concurrently in a computation. 

A computation ^ is extensible in S iff: 

\/K C S' : ^ consistent (iF) ^ 3 global state G ^ K : consistent (G) 

Intuitively, extensibility allows us to extend a consistent set of states to a con- 
sistent global state. Any computation in S can be made extensible by adding 
“dummy” states to S. Therefore, we implicitly assume that any computation is 
extensible. 

Given a computation let < be a relation on global states defined as: 
G < H = Vi : {Gi ^i Hi) V {Gi = iL^). It is a well-known fact that the set 
of ^-consistent global states is a lattice with respect to the < relation [9]. In 
particular, we will use ^-glb{G, H) for the greatest lower bound of G and H with 
respect to ^ (so, ^-glb(G, H)i =— >i-min(Gi, Hi)). If G and H are ^-consistent, 
then ^-glb(G, H) is also ^-consistent. 
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Given a computation — > and a global predicate 5, a computation is 
called a controlling computation of 5 in iff (1) and (2) MG : 

consistent(G) ^ B{G). This tells us that a controlling computation is a stricter 
partial order (containing more synchronizations). Further, any global state that 
may occur in the controlling computation must satisfy the specified global pred- 
icate. Thus, the problem of finding a controlling computation is to add syn- 
chronizations until all global states that violate the global predicate are made 
inconsistent. More formally. 

The Off-line Predicate Control Problem: Given a computation ^ and a 

global predicate 5, find a controlling computation of 5 in 

3 Solving the Off-Line Predicate Control Problem 

In [16], it was proved that the Off-line Predicate Control is NP-Hard. Therefore, 
it is important to solve useful restricted forms of the Off-line Predicate Control 
Problem. Since we are interested in avoiding race conditions, we restrict the 
general problem by letting B specify the mutual exclusion property. 



Off-line Independent Read-Write Mutual Exclusion 



Off-line Readers Writers 




Off-line Independent Mutual Exclusion 



Off-line Mutual Exclusion 



Fig. 2. Variants of Off-line Mutual Exclusion 



The simplest specification for mutual exclusion is: no two critical sections ex- 
ecute at the same time. This corresponds to the semantics of a single exclusive 
lock for all the critical sections. We call the corresponding problem the Off- 
line Mutual Exclusion Problem. We can generalize this to the Off-line Readers 
Writers Problem by specifying that only critical sections that “write” must be 
exclusive, while critical sections that “read” need not be exclusive. This corre- 
sponds to the semantics of read-exclusive- write locks. Another way to generalize 
the Off-line Mutual Exclusion Problem is to allow the semantics of independent 
locks. In this Off-line Independent Mutual Exclusion Problem^ no two critical sec- 
tions of the same lock can execute simultaneously. Einally, we can have critical 
sections with the semantics of independent read-exclusive- write locks. This is the 
Off-line Independent Read-Write Mutual Exclusion Problem. Eigure 2 illustrates 
the relative generality of the four problems. 

In traditional on-line mutual exclusion, there has been no “independent” 
variant, since it trivially involves applying the same algorithm for each lock. 
However, in off-line mutual exclusion, such an approach will not work, since 
the synchronizations added by each independent algorithm may cause deadlocks 
when applied together. 



Software Fault Tolerance of Concurrent Programs 215 



For the practitioner, an algorithm which solves Off-line Independent Read- 
Write Mutual Exclusion would suffice, since it can be used to solve all other 
variants. However, for the purpose of presentation we will start with the simplest 
Off-line Mutual Exclusion Problem and then generalize it in steps. Eor each 
problem, we will determine the necessary and sufficient conditions for finding a 
solution. Einally, we will make use of the results in the design of an algorithm 
which solves the most general of the four problems. 



3.1 Off-Line Mutual Exclusion 

Off-line Mutual Exclusion is a specialization of Off-line Predicate Control to the 
following class of global predicates: 

Bmutex{G) = \/ distinct s^t ^ G \ ^{critical{s) A critical(t)) 

where critical is a function that maps a state onto a boolean value. Thus, B^utex 
specifies that at most one process may be critical in a global state. 

Based on the critical boolean function on states, we define critical sections 
as maximal intervals of critical states. More precisely: given a critical function 
on S and a computation ^ on S', a critical section^ GS, is a non-empty, maximal 
subset of an Si such that: ( 1 ) Vs G GS : critical{s), and ( 2 ) Vs,t G GS : G 

Si : s ^ u ^ t ^ u e GS. 

Let GS. first and GS.last be the minimum and maximum states respectively 
in CS (w.r.t. ^i). Let be a relation on critical sections defined as: GS 
CS' = GS. first GS' .last A GS GS' Thus, orders a critical section 

before another if some state in the first happened before some state in the second. 
Note that may have cycles. 

We will be dealing with different computations. All computations will have 
the same total order — for each Si. Therefore, the set of critical sections will 
not change for each computation. However, the i-^ relation will change, in gen- 
eral. Eor computations and the relation on critical sections will be 

denoted as and respectively. 

Theoreui 1 (Necessary Couditiou) For a computation of S, and a global 
predicate ^ 

a eontrolling eomputation of B mutex in exists ^ has no eyeles 



Proof: We prove the contrapositive. Let have a cycle, say GS\ ^ GS2 
• • • GSm C' 5 'i, {m > 2 ) and let be a computation such that Since 

cannot have a cycle, at least one of: 

G Si.lastf^^G S2‘ firsts G S2dast-/^^G Ss. firsts • • •, and G Smdast-/^^G S\. first 
must hold. Without loss of generality, let GSi.last-/^'^GS2-first. We also have 
G S2dast-/^'^G S\. first (since GSi 1-^ C'5'2). Since is extensible, we can de- 
fine 52 as the maximum state in S2 such that GSi.last 52, and 5 i as the 
maximum state in Si such that GS2dast ||^ si. By extensibility of we 
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can find ^^-consistent global states Gi and G 2 containing {GS\.last^S 2 } and 
{GS 2 >ldst^si} respectively. We now have two cases: 

Case 1 . [si G G S\ V S2 G G*S 2 ] this case ~~'Brp^y^fQx(^G2^ V 

Case 2 : [si ^ GSi A S 2 ^ GS2] Since si ^ GSi, there are two ways to 
position si'. (a) si G Si. first or (b) G Si. last si. In sub-case (a), since 

G S2-last-/^^G Si. firsts either C5'2./a5t||^C6'i./irst or GSi. first GS2-last^ 

which gives us si GS 2 -last. Both possibilities contradict the definition of si. 
This leaves sub-case (b) as the only possibility. Therefore, GSi.last si. 
Similarly, we can prove GS2-last S2- Let H =^"^-glb(Gi, G 2 ). H contains 
GSi.last and GS2-last and, so, ^Bmutex{H). Further, H is ^"^-consistent (by 
the lattice property). 

So in either case, is not a controlling computation of B mutex in □ 

Theorem 2 (Sufficient Condition) For a computation of S, and a global 
predicate Bmutex ^ 

^ has no cycles ^ a controlling computation of B mutex i'^ exists 



Proof: Since 1 -^ has no cycles, we can arrange all of the critical sections in a 
sequence: GS'i, G5'2, GSm such that GSi ^ GSj ^ i < j- Let be 
defined as U {{GSi.last, GSi-^i. first) \ 1 < i < m — 1 })^, where ()+ is the 
transitive closure. Clearly In the next paragraph, we will prove that 

is a partial order. Assume that there is a global state G such that ^Bmutex{G). 
Therefore, we can find states s and t such that critical{s) and critical{t) . Let 
GSi and GSj be the two critical sections to which s and t belong respectively, 
w.l.o.g, let i < j. Therefore, s t, and -■ ^^-consistent(G). Therefore, 
consistent(G) ^ Bmutex{G). So is a controlling computation of Bmutex 
in 

Our remaining proof obligation is to prove that is a partial order. To this 
end, let be defined as: U {{G S i. last, GSiJ^i. first) \ 1 < i < k — 1})^. 

We make the following claim: 

Claim: \/l < k < m : ( 1 ) is a partial order, and (2) GSi GSj i < j 

Clearly and so this claim implies that is a partial order. 

Proof of Claim: (by Induction on k) 

Base Case: Immediate from 

Inductive Case: We make the inductive hypothesis that is a partial 

order, and that GSi GSj ^ i < j- We may rewrite the definition of 

as: (^^“^ U {{GSk-idast, GSk-first)})~^ . First we demonstrate that 
is irrefiexive and transitive (which together imply asymmetry). 

(i) Irreflexivity: Let s t. There are two possibilities: either s t 

or s GSk-i‘last A GSk‘ first t. In the first case, the inductive 

hypothesis tells us that is irrefiexive and so s t. In the second case, part 

(1) of the inductive hypothesis tells us that is transitive, and part (2) of 

the inductive hypothesis tells us that GSk- fir styP^~^GSk-i last and so s ^ t. 
(a) Transitivity: This is immediate from the definition of 
Therefore, is a partial order. We now show the second part of the claim. 
Suppose GSi This implies that GSi. first GSj.last A i ^ j. There 
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are two cases: either C Si. first CSj.last A i ^ j or C Si. first 

CSk-i-last A CSk-first CSj.last f\i j. In the first case, we have 

CSi CSj and so by the inductive hypothesis, i < j. In a similar manner, 

the second case would give us i < k — 1 A k < j and so i < j. □ 

In conclusion, the necessary and sufficient condition for finding a controlling 
computation for Bmutex is that there is no cycle of critical sections with respect 
to Further note that, since the proof of Theorem 2 is constructive, we can 
use it to design a naive algorithm to find a controlling computation. (We will 
see why this algorithm is naive in Section 3.5). 



3.2 Off-Line Readers Writers Problem 



Let read-critical and write -critical be functions that map a state onto a boolean 
value. Further, no state can be both read-critical and write -critical (any read 
and write locked state is considered to be only write locked). Let critical(s) = 
read-critical{s) V writc-critical{s) . The Off-line Readers Writers Problem is a 
specialization of the Off-line Predicate Control Problem to the following class of 
global predicates: 

Brw{G) = \/ distinct s^t ^ G '. ^{writc-critical{s) A critical(t)) 



Given a rcad-critical function and a write -critical function on S and a 
computation ^ on S', we define a read critical section and a write critical section 
in an analogous fashion to the critical sections that we defined before. Note 
that, since no state is both read-critical and write -critical ^ critical sections in 
a process do not overlap. 

Let be a relation on both read and write critical sections defined as: 
GS ^ GS' = GS. first G S'. last A GS ^ GS' 



Theorem 3 (Necessary Condition) 

predicate Brw, 

a controlling computation of ^ 
Brw i'^ exists 



For a computation of S, and a global 

all cycles in ^ contain 
only read critical sections 



Proof: The proof is similar to the proof of Theorem 1. We will prove the con- 
trapositive. Let ^ have a cycle, say GSi ^ GS 2 ^ - GSm ^ GS\. Without 

loss of generality, let GS\ be a write critical section. Let be a computation 
such that 

First, we claim that there is at least one critical section in the cycle say GSk 
(where /c 7^ 1), such that G Si.lastf^^G Sk- first and GSkdast-f^^GSi. first. To 
prove this, we assume the opposite: 

yGSk {k 1) : GSi.last GSkJirst V GSkdast GS\. first - (i) 
and prove a contradiction as follows. GSm^GSi implies GSi.lastyA^GSm-fi'f^st. 
Therefore, by (i), GSmdast GS\. first. This allows us to define j as the 
smallest integer such that GSj.last G Si. first. GSi ^ GS 2 implies that 
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CS2-last-^^C Si. first. Therefore, j ^ 2. In particular, CSj-\ and CS\ are dis- 
tinct. By our choice of j, C Sj-i.last-/-^^C S\. first. So, using (i), CS\.last 
CSj-i. first. We now have a cycle: C Si. last CSj-i. first (as above), 

CSj-i. first CSj.last (since CSj-i CSj), CSj.last C Si. first (by 

our choice of j), and C Si. first C Si. last (by the definition of first and 

last). This cycle contradicts the fact that is a partial order. 

Since we have demonstrated the existence of a CSk such that CSi.last 
CSk‘ first and CSk-last CSi. first, we can use a proof similar to the one in 
Theorem 1 to show that is not a controlling computation of B^w in □ 

Theorem 4 (Sufficient Condition) For a computation of S, and a global 
predicate Brw, 

all cycles in ^ contain ^ a controlling computation of 
only read critical sections B^w in exists 



Proof: Consider the set of strongly connected components of the set of critical 
sections with respect to the i-^ relation. Define the ^ relation on strongly con- 
nected components as SCC ^ SCC' = 3CS G SCC, CS' G SCC' : CS ^ 
CS' A SCC 7^ SCC'. It is verifiable that ^ is a partial order. Therefore, 
we can linearize it to get a sequence of all strongly connected components, say 
SCCi,SCC 2 r-SCCi such that SCCi ^ SCCj ^ i < j. Let be de- 
fined as (^ U {{CSi.last,CSj. first) \ CSi G SCCk^CSj G SCCk-\-i for some 
1 < k < / — !})+. Clearly We can show that is a partial order along 

similar lines as the proof of Theorem 2. 

We now show that is a controlling computation of B^w in Suppose G 
is a global state such that ^Brw{C). Therefore, we can find states s and t such 
that writc-critical{s) and critical(t). Let CS be a write critical section that 
contains s and let CS' be a critical section that contains t. Let SCCi and SCCj 
be the strongly connected components that contain CS and CS' respectively. 
SCCi is distinct from SCCj since, otherwise, there would be a cycle in that 
contains a write critical section. Without loss of generality, let i < j. By the 
definition of we have s t and, therefore, -i ^"^-consistent(G). Therefore, 

is a controlling computation of Brw in □ 

Note, as before, that the proof of Theorem 4 can be used to design an algo- 
rithm to find a controlling computation. 



3.3 Off-Line Independent Mutual Exclusion 

Let criticali,critical 2 r critical m be functions that map an event onto a 
boolean value. The Off-line Independent Mutual Exclusion Problem is a spe- 
cialization of the Off-line Predicate Control Problem to the following class of 
global predicates: 

Bind{G) = \f distinct s,t E G : Vi: ^ {critical i{s) A criticali{t)) 
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Given a function criticak on S and a computation ^ on S', we define an 
i- critical section in an analogous fashion to the critical sections that we defined 
before. Note that the definition allows independent critical sections on the same 
process to overlap. In particular the same set of states may correspond to two 
different critical sections (corresponding to a critical section with multiple locks) . 
Let be a relation on all critical sections defined as before. 

Theorem 5 (Necessary Condition) 

For a computation of S , and a global predicate Bind, 

a controlling computation of ^ has no cycles of i- critical 

Bind 'I'o. exists sections, for some i 

Proof: The proof is almost identical to the proof of Theorem 1. □ 

Theorem 6 (Sufficient Condition) For a computation of S, and a global 
predicate Bind, 

^ has no cycles of i- critical a controlling computation of 

sections, for some i Bind in exists 

Proof: The proof is along similar lines to the proof of Theorem 4. In this case 
we take strongly connected components as before, but make use of the fact that 
no two i-critical sections may be in the same strongly connected component 
(otherwise, there would be a cycle of i-critical sections). □ 



3.4 Off-Line Independent Read- Write Mutual Exclusion 

Using similar definitions, the Off-line Independent Read- Write Mutual Exclusion 
Problem is a specialization of the Off-line Predicate Control Problem to the 
following class of global predicates: 

Bind-rw{G) = V distinct s,t G G : Mi \ ^ {write .critical i{s) A criticali{t)) 

As before, we define i-read critical sections and i-write critical section 
{1 <i <m). Similarly, let be a relation on all critical sections. The necessary 
and sufficient condition is a combination of that of the previous two sections. 
Since the proofs are similar to the previous ones, we simply state: 

Theorem 7 (Necessary and Sufficient Condition) 

For a computation of S, and a global predicate Bind-rw, 

a controlling computation of = all cycles of i- critical sections in i-^ 
Bind-rw in exists contain only read critical sections 
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Types: state: (pid: int; v: vector -dock) ] 

critical section: (pid: int; first: state; last: state; 

cs-id: integer; write -critical: boolean); 
strongly-conn -Component: set of critical section; 

Input: Cl, C2, • ‘ ‘ ? Cn: list of critical section 

Output: O: list of {state, state), initially null 

Vars: sccset, crossable: set of strongly-conn -component 

crossed, prev, curr: strongly -Conn -component 
cs,cs': critical section 

ordered: list of stronglyjconn -component 



while (Vz : Ci ^ null) do 

sccset := getscc{Ci.head, C 2 -hcad, • • • , Cn-head) 
crossable := { s G sccset \ Vs' G sccset, s' ^ s : s' ^ s } 
crossed := select (crossa6/e); 
if {not-valid{crossed)) then 

exit(“Ao Controlled Computation Exists^^); 
for each cs in crossed do 
Ccs.pid-deleteJieadQ; 
ordered. add Jiead{cr os sed) ; 
prev := ordered.deleteJieadQ; 
while {ordered ^ null) do 

curr := ordered.deleteJiead{); 
for each cs in prev and cs' in curr do 
if {cs.last 7 ^ cs' .first) then 
O. add -head{cs. last, cs' .first); 



Fig. 3. Algorithm for Off-line Independent Read- Write Mutual Exclusion 



3.5 Algorithm 

Figure 3 shows the algorithm to find a controlling computation of Bind-rw in 
Since the other forms of mutual exclusion are special cases of Bind-rw^ this 
algorithm can be applied to any of them. 

The input to the algorithm is the computation, represented by n lists of 
critical sections Ci, • • • , Cn- For now, to simplify presentation, we assume that 
critical sections are totally ordered on each process. Each critical section is rep- 
resented as its process id, its first and last states, a type identifier cs-id that 
specifies the criticalcs-id function, and a flag indicating if it is a write or read 
critical section. The partial order is implicitly maintained by vector clocks [9] 
associated with the first and last states of each critical section. The algorithm 
outputs the relation specified as a list of ordered pairs of states. 

The first while loop of the algorithm builds ordered, a totally ordered set of 
strongly connected components of critical sections (called sec’s from here on). 
The second while loop simply uses ordered to construct the relation. 

The goal of each iteration of the first while loop is to add an sec, which is min- 
imal w.r.t. to ordered (where ^ is the relation on sec’s defined in the proof 
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of Theorem 4). To determine this sec, it first computes the set of sec’s among 
the leading critical sections in Ci, • • • Since no sec can contain two critical 
sections from the same process, it is sufficient to consider only the leading critical 
sections. From the set of sec’s, it determines the set of minimal sec’s, crossable. 
It then randomly selects one of the minimal sec’s. Finally, before adding the sec 
to ordered^ it must check if the sec is notjvalid^ where not-valid(crossed) = 
3cs^cs' G crossed : cs.csJd = cs' .csJd A cs. write -critical. If an invalid sec is 
found, no controlling computation exists (by Theorem 7). 

The main while loop of the algorithm executes p times in the worst case, 
where p is the number of critical sections in the computation. Each iteration 
takes O(n^), since it must compute the sec’s. Thus, a simple implementation 
of the algorithm will have a time complexity of 0{v?p). However, a better im- 
plementation of the algorithm would amortize the cost of computing sec’s over 
multiple iterations of the loop. Each iteration would compare each of the critical 
sections that have newly reached the heads of the lists with the existing sec’s, 
thus forming new sec’s. Therefore, each of the p critical section reaches the head 
of the list just once, when it is compared with n — 1 critical sections to deter- 
mine the new sec’s. The time complexity of the algorithm with this improved 
implementation is, therefore, 0{np). Note that a naive algorithm based directly 
on the constructive proof of the sufficient condition in Theorem 7 would take 
0{p^). We have reduced the complexity significantly by using the fact that the 
critical sections in a process are totally ordered. 

The algorithm has implicitly assumed a total ordering of critical sections 
in each process. However, as noted before, independent critical sections on the 
same process may overlap, and may even coincide exactly (a critical section with 
multiple locks is treated as multiple critical sections that completely overlap). 
The algorithm can be extended to handle such cases by first determining the 
sec’s within a process. These sec’s correspond to maximal sets of overlapping 
critical sections. The input to the algorithm would consist of n lists of such 
process-local sec’s. The remainder of the algorithm remains unchanged. 

4 Application to Software Fault Tolerance 

Our proposed scheme for software fault tolerance consists of four parts: (i) tracing 
an execution, (ii) detecting a synchronization failure, (hi) determining a control 
strategy, and (iv) re-executing under control. This paper has focused mainly on 
the problem of determining a control strategy. We have designed an efficient 
algorithm that determines which synchronizations to add in order to avoid very 
general forms of mutual exclusion violation. As mentioned before, the other three 
parts of our scheme have been addressed as independent problems. We now put 
all the pieces together for a comprehensive look at how race failures (mutual 
exclusion violations) can be tolerated. 

The problem of determining a control strategy was placed in a very general 
model of concurrent execution. However, tracing, detection, and controlled re- 
execution depend greatly on the particular concurrent paradigm. We choose a 
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(a) Traced Computation 



(b) Critical Section Graph 



(c) Controlling Computation 



Fig. 4. Example: Tolerating Races in a Concurrent Execution 



simple example that demonstrates the key issues that will arise in most concur- 
rent paradigms. Consider a distributed system of processes that write to a single 
shared file. The file system itself does not synchronize accesses and so the pro- 
cesses are responsible for synchronizing their accesses to the file. If they do not 
do so, the writes may interleave and the data may get corrupted. Since the file 
data is very crucial, we must ensure that races can be tolerated. Synchronization 
occurs through the use of explicit message passing between the processes. 

The first part of our mechanism involves tracing the execution. The con- 
cern during tracing is to reduce the space and time overhead, so that tolerat- 
ing a possible fault does not come at too great a cost. Much work has been 
done in implementing tracing in various paradigms, while keeping the overhead 
low [2,11,12,15]. In our example, we use a vector clock mechanism [9], updat- 
ing the vector clock at each send and receive point. This vector clock needs to 
be logged for each of the writes to the file (for our algorithm). The vector clock 
values must also be logged for each receive point (for replay). When a write is ini- 
tiated, and when it returns, the vector clock must be logged. In our example, the 
writes are typically very long and therefore are performed asynchronously. Thus, 
execution continues while the write is in progress. In particular, the process may 
receive a message from another process during its write to the file. Inserting some 
computation at the send, receive, write initiation, and write completion points 
can be achieved either by code instrumentation, or by modifying the run-time 
environment (message-passing interface and the file system interface). 

The second part of our mechanism is detecting when a race occurs. Many 
existing tools have been built to solve exactly this problem [4,10]. Since we 
use message passing as our synchronization mechanism, the methods described 
in [10] are particularly applicable. 

Once a race has been detected, we roll-back all processes to a consistent 
global state prior to the race. We also roll-back the file to a version consistent 
with the rolled-back state of the processes. (We assume a versioned file system 
with the ability to roll back.) We then take the section of the traced vector 
clock values that occur after the rolled-back state. These indicate the critical 
section entry and exit points required by our algorithm. The algorithm would 
take 0{np) time, where n is the number of processes and p is the number of 
critical sections that have been rolled back. The output of the algorithm is 
the set of added synchronizations specified as pairs of critical section boundary 
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points. Figure 4 demonstrates a possible scenario. Here the semantics of mutual 
exclusion correspond to a single exclusive lock. Therefore, the necessary and 
sufficient condition is that there are no cycles in the critical section graph shown 
in Figure 4(b). Applying the algorithm would add synchronizations to give the 
controlling computation shown in Figure 4(c). 

The next step is to replay the processes using the logged vector clock values 
of the receive points. Each receive point must be blocked until the same mes- 
sage arrives as in the previous execution. This is a standard replay mechanism 
(e.g. [12]). In addition to this replay, we must impose additional synchroniza- 
tions. For example, suppose (s, t) is one of the synchronizations output by our 
algorithm. The state s is a critical section exit point while t is a critical section 
entry point. Each of these additional synchronizations is implemented by a con- 
trol message sent from s and received before t. Thus, at each critical section exit 
point, we must check the added synchronizations to decide if a control message 
must be sent. At each critical section entry point, we must check the added syn- 
chronizations to decide if the process must block waiting for a control message. 
As in tracing, the points at which computation must be added are the write 
initiation and completion points, and the send and receive points. Again, we can 
accomplish this by code instrumentation or run-time environment modification. 

We have chosen an example in which the processes only write to the file. If 
the processes were to read from the file as well, then that would cause causal 
dependencies between processes. Then we would have to track these causal de- 
pendencies as we did for messages. Another option would be to assume that 
these causal dependencies do not affect the message communications, in which 
case, we do not have to track them. However, if we take this approach, we would 
have to check to see that our traced computation is the same as the one be- 
ing replayed. In case of a divergence, we would leave the execution to proceed 
uncontrolled from the point of divergence. 



5 Concluding Remarks 

We have presented an approach for tolerating synchronization faults in concur- 
rent programs based on rollback and controlled re-execution. Our focus in this 
paper has been on races, which form a particular type of synchronization fault. 
In order to determine a control strategy that avoids races while re-executing, we 
have solved the off-line predicate control problem for various forms of mutual 
exclusion properties. We have determined the necessary and sufficient condi- 
tions for solving off-line predicate control for simple mutual exlusion, read- write 
mutual exclusion, independent mutual exclusion, and independent read-write 
mutual exclusion. We have presented an efficient algorithm that solves for the 
most general property, independent read- write mutual exclusion. The algorithm 
takes 0{np) time, where n is the number of processes and p is the number of 
critical sections. Einally, we have demonstrated how races can be tolerated using 
our algorithm. An implementation of software fault tolerance using controlled 
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re-execution is currently being developed in order to evaluate the performance 
and effectiveness of the technique in practice. 

It may be argued that mutual exclusion could be simply handled at the pro- 
gramming language level using locks (in other words, on-line mutual exclusion, 
as opposed to off-line mutual exclusion). However, there are good reasons for our 
approach. Firstly, as noted in Section 1, it is impossible to ensure that there will 
be no deadlocks with on-line locking unless some assumptions are made, such 
as non-blocking critical sections. In off-line mutual exclusion, no such assump- 
tions are required. Secondly, programmers make mistakes, being prone to reduce 
locking for greater efficiency. Thirdly, source code is often unavailable for mod- 
ification, while requirements change dynamically. In modern component-based 
systems, different components may come from different vendors and it may be 
difficult to ensure a consistent locking discipline throughout the code. The best 
approach is to use both good programming discipline and a sofware fault toler- 
ance technique to make programs more resistant to failures. 
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Abstract. This paper introduces DUALITY, a design model that pro- 
vides a more structured style of parallel programming and refines causal- 
ity from concurrency. We investigate semantic and syntactic transforma- 
tions that support identifying the structure of a parallel program, as the 
basis for reducing the design complexity. The initial focus is on specifi- 
cation and correctness, then gradually adding architectural details and 
finally addressing efficiency. A parallel program is viewed as a Meta- 
Program - the result of causally composing an architecture-independent 
algorithm - the specification, with an architecture-dependent program 
- the mapping. This approach supports the derivation of efficient par- 
allel implementations from program specifications. Consequently, trans- 
parent and architecture-independent specifications can be transformed 
into forms that match particular target architectures. Correctness of 
the implementation is inferred from correctness of the specification, by 
gradually imposing temporal and causal order and by transforming any 
property of the specification into a property of the parallel program. 
DUALITY relates data and process parallelism and aims to reuse de- 
sign knowledge from sequential patterns. DUALITY is developed in the 
context of the UNITY formalism and the principle and algebraic laws of 
Communication Closed Layers (CCL), and illustrated through the algo- 
rithm of all-points shortest path. 



1 Introduction 

It has been stressed in the recent years the effectiveness of a more abstract 
model for the design of parallel programs. This results in a separation between 
the structure of the parallel program from the organization of the parallel archi- 
tecture. 

Chandy and Misra’s UNITY formalism [1] provides a simple implementation- 
independent notation and a proof system based on temporal logic. A UNITY 
program is decoupled from its implementation. It derives much of its simplicity 
by abstracting away from the notion of control flow. The programs are bags 
of multiple assignments executed nonderministically, under the weak fairness 
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assumption that each assignment is selected infinitely often. The formalism is 
suitable for specifying and verifying various safety and progress properties of con- 
current programs whose semantics is modeled by weakly fair transition systems. 
A strong aspect determined by the simplicity of the notation is the capacity to de- 
velop parallel or distributed applications jointly with correctness proofs. UNITY 
enforces the idea of representing parallel or distributed systems as collections of 
guarded actions, without any apparent algebraic structure. The motivation is 
that the algebraic syntactic structure of the programming notation is too close 
to the actual architecture of the target machines and that this aspect should 
not influence the (initial) design of the systems. The work in [1] stresses the 
design of parallel or distributed systems starting with an initial phase, which is 
architecture and implementation independent. From here the difficulty and the 
challenge of balancing UNITY’S absence of control structures with the goal to 
devise efficient UNITY implementations on existing parallel or distributed ar- 
chitectures. This aspect can become a drawback for complex systems like mobile 
distributed systems or real-time embedded systems. 

Bouge [2] and Snyder [11] argue as well for a more abstract model for the 
design of parallel and distributed programs based on the separation between the 
structure of the program, from the organization of the architecture. Their work 
shows that the data-parallel programming model can make a clear separation 
between the programming model and the execution model. 

Another school of thoughts evolved from the algebraic approach often re- 
ferred as modularity or compositionality principle [7], [9]. Although there is an 
agreement with the idea that initial design should ignore architectural details, 
the compositionality principle states that a (parallel or distributed) system can 
be derived from specifications of its components, by ignoring their internal alge- 
braic structure. However, in contrast with [1], this trend does not consider that 
algebraic syntactic structure of the programming notation is too close to the tar- 
get architecture. Much more, it is claimed that what leads to dependency, on one 
or other architecture, is due to certain language operations such as sequential 
operation. A model for concurrency is proposed. It allows distinction between 
parallel program composition F [] G [9], classical sequential composition F ; G 
and a weak sequential composition, called layer composition, F •G (inspired by 
Elrad and Francez [3] Gommunication Glosed Layers principle). For the layer 
composition F • any event / from F precedes any event g from G, with which 
is in conflict, in causal dependence. By comparison, sequential composition^ de- 
noted with is a representation of the temporal dependence. Work in [7, 9] 
formulates the algebraic laws of the layer composition, providing that there is 
no conflict between the programs F and K and between H and G respectively: 

{F[]H).{G[]K) = {F>GmH.K). 

This law is not valid if causal order, expressed by the layer composition, is 
replaced with temporal order, represented by the sequential composition. The 
Communication Closed Layers (CCL) principle enables a direct representation 
of properties involving temporal and causal order. Identifying causality of the 
events helps in bounding the nondeterminism. Serializability is a property that 
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encompasses causal order and concurrency. It is very useful for program refine- 
ment, preserving mapping in the syntactic domain, for the purpose of efficient 
implementations on parallel architectures. 

Our paper presents DUALITY [14], an extension of the UNITY concurrency 
model with compositional programming abstractions and partial order based 
structures. One reason to build on UNITY is given by the availability of an 
architecture- independent programming notation and logic, within the same for- 
malism. We believe that our design model can be also supported by partial- 
order temporal logic specifications like ISTL [8], or at some extent, by branching 
time logic or CTL* [5]. DUALITY strongly adheres to the idea of ignoring im- 
plementation or architectural details during the initial design phases. Yet, our 
goals are to balance transparency and adaptation for applications running on 
various classes of parallel or distributed machines, including mobile or hybrid 
architectures. 

DUALITY framework integrates both UNITY and the compositional, alge- 
braic principle and intends to add more expressiveness and modularity to parallel 
programs. UNITY programs can be represented under the interleaved model of 
concurrency i.e. one distinguishes between execution sequences, which are not 
significantly different. The execution sequences are equivalent because they ei- 
ther, differ by the ordering of independent events or by the number of times a 
state is adjacently repeated. If an ordering is given in the specification, then it 
is assumed that the mapping also preserves the ordering. This could unnecessar- 
ily lead to an increased complexity of the design and verification. Partial-order 
semantics has been developed for modeling concurrency in a more abstract and 
faithful manner. The ordering is enforced in specification and mapping, only for 
the cases of dependent events, since independent events can be executed in any 
order. Under partial-order semantics, UNITY specifications no longer make dis- 
tinction between equivalent execution sequences, thus reducing the size of the 
explored state-space. Instead of representing all possible interleavings, all pos- 
sible execution sequences, it is sufficient to represent at least one sequence per 
equivalence class. UNITY properties by definition apply globally on programs. 
The idea helps in introducing new properties based on partial-order semantics 
in UNITY formalism. 

The outline of this paper is as follows. Section 2 gives a brief review of the 
background concepts and the formal semantics for the partial-order structures, 
defined in in terms of UNITY. Section 3 presents an informal description of the 
DUALITY design model. Section 4 illustrates a case of program transformation 
and design reuse. Finally, Section 5 concludes and discusses on future research 
directions. 

2 Extending UNITY with Partial Order Based Structures 

UNITY formalism [1] provides a language to construct well-founded formulas 
and a logic to construct proofs. This section gives a brief overview of UNITY. 
It also introduces the constructs that support a structured style of parallel pro- 
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gramming, and the method to integrate Communication Closed Layers as a 
compositional abstraction in UNITY. 

UNITY notation allows an abstract representation of the parallel computa- 
tion. A program has a section for variable declarations, a section specifying the 
initial values for the variables (the INITIALLY predicate) and a finite set of as- 
signments. A program is similar to a ’’bag” of assignments. Each assignment can 
be selected non-deterministically and eventually executed, and returned to the 
’’bag” . The execution model for UNITY programs is based on sequences of tuples 
(a tuple being a state). Each element of a sequence models an execution step of 
the program. A sequence denotes a possible run. A program can terminate if it 
reaches a fixed poini where no state transition is possible. In a UNITY program 
there is no notion of control structures or control flow (considered implementa- 
tion aspects). A program can be mapped then implemented on different types 
of architectures. The correctness of the UNITY program is independent of the 
target architecture and the manner in which the program is executed. However, 
efficiency must be evaluated with reference to the target architecture. 



Notations: s, t - UNITY assignments; s.r - the set of variables read by s; s.w 
- the set of variables written by s; L, F - UNITY programs; L.a - the set of all 
assignments, belonging to L; L.r - the set of all variables read by L; L.w - the 
set of all variables written by L; p - a precondition that must be true before the 
execution of s and g - a postcondition, resulting after the execution of s. Proofs 
in UNITY are based on assertions of type {p} s {q}. A property P(p,q) applies 
globally to a program F, i.e. {p} F {q}, if and only if <Vs : s G F :: {p} s {q} >. 
For brevity, this paper focuses on three of the UNITY logic properties: invariant 
and CO - safety properties and leads-to - progress property. 

'’invariant p = {INITIALLY ^ p) A ({p} F {p})”. To illustrate the property, 
’’invariant x>0” states that the variable x is positive in any state of the program 
execution. 

”p CO g = < Vs: s in F :: {p} s {q} >”; it signifies that if a state satisfying p 
occurs anywhere in the execution, the next state must satisfy q. 

”p leads-to g” - signifies that once p becomes true q will be true eventually. 
However, it cannot be asserted that p will remain true as long as q is not. To 
illustrate, ”x=0 leads-to x>0”, states that whenever x is equal to 0 during a 
program execution, then it will eventually become positive at some later state 
in the execution. 

UNITY defines union as a nondeterministic parallel composition. The union 
of the programs S\ and S 2 is denoted S\ [] S 2 and has the following properties: 

(i) INITIALLY{Si 0 S 2 ) = INITIALLY{Si) A INITIALLY{S 2 ) 

(ii) {Si [] ^2).a = 5i.a U ^2.a. 

The main idea is that the safety of the composite program follows from the 
safety of its components. A shared- memory or a distributed program can be 
represented as a union: F = S'! [] £'2 [] •• [] Sn- Nevertheless, the absence of 
control structures in UNITY limits the expressiveness of the union when com- 
posing specifications and mappings for various parallel or distributed architec- 
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tures. This leads us to the conclusion that UNITY notation for the mappings 
should provide certain forms of control structures. A more structured style of 
parallel programming can be provided, by isolating weak sequential phases and 
their interrelationship. Communication Closed Layers principle [3] and algebraic 
laws [7] formalize the concept of weak sequential composition. The idea is that 
actions which are not dependent (unilateral or bilateral data dependent) do not 
have to wait for one another to proceed, even if they are composed sequentially. 
In the DUALITY framework, we relate UNITY model with the interleaving and 
the partial-order semantics. The interleaved execution of a UNITY program can 
be interpreted as a succession of partial order executions. We extend UNITY 
with programming abstractions that involve temporal and causal order and help 
in identifying the interface between program structures. 



Property 1 Strong Enable s. If s labels an assignment, we define s. event a 
predicate that is true whenever s is selected and the guard is true (eventually 
this leads to a program state transition): strong enable s = s. event. 



Property 2 Weak Enable s. If a selected statement s has false guard, s. event 
is false: weak enable s = ^(s. event). (This is equivalent to the execution of a skip 
statement). 



Definition 1 Progress Condition. L. event denotes a predicate, which is true 
if exists an assignment s G L.a, such that s. event is true (i.e., strong enable s 
holds). All the states where L. event holds make a consistent set of local states 
associated with the program L: L. event = < 3s : sGL.a A s. event >. 



Definition 2 Activation Condition. L.init denotes a predicate associated 
with the initial set of states, which precedes the selection and execution of any 
assignment belonging to L.a. 



Definition 3 Fixed Point. A Fixed Point, FPl, for the program L is a state 
predicate such that the execution of any statement s G L.a in that state leaves 
the state unchanged: FPl = <Vs : s G L.a A p :: {p} s {p} >. 



Definition 4 Group-of- Actions. A Group- of- Actions is a program structure 
L identified by the activation condition, progress condition and respectively the 
Fixed-Point such that following properties hold: 

(i) L.init co {L. event V L.init) 

(ii) L. event leads-to FPl 

Property (i) states that the execution of the Group- of- Actions L is always 
enabled from its activation condition. Property (ii) states that the progress con- 
dition for the program L causally precedes the eventual Fixed Point. 
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Example: Program F uses Sieve Algorithm to compute the first count = K 
prime numbers from a sequence of natural numbers seq.i (l<i<N,N>K).F 
(Fig. l)is a union composition F = <[] j : 1 < j < N :: Lj > . The interfaces 
of Lj are specified by the activation condition, 

Lj. imt = count = j- lAj<KA seq.i = j < N; FPLj = j < seq.i < N; and 
respectively by the progress condition 
Lj. event = seq.i > j A count < K. 



Program Lj initially count = j- lAj<KA seq.i = j < N 
assign 

count, i := count + 1 , i +1 if seq.i = j [] 
i := i +1 if (seq.i > j) A (seq.i mod j = 0) [] 
seq.j+l, i := seq.i, i+1 if (seq.i > j) A (seq.i mod j 7 ^ 0) 
end 

Fig. 1. Program F - Sieve Algorithm 



We introduce new compositional abstractions that restrict nondeterminism 
and enable a direct representation of properties that involve temporal and causal 
order. (Notations used Li and Lj groups of actions and s, t UNITY assignments). 

Definition 5 Mutual-Dependence. ”s mutual- dependent t” iff 
(s data-bilateral-dependent t) A (s temporal-bilateral-dependent t). 
Example: s: x:=l if y=0 [] t: y:=0 if x=l; s.w =t.r={x} s.r=t.w={ y }; data 
bilateral dependence is expressed (s.w D t.r 7 ^ 0) A (s.r D t.w 7 ^ 0)) and temporal 
bilateral dependence, (s. event before t. event) V (t. event before s. event). 

”Li mutual-dependent Lj” iff 3s G L^.a A 3t G Lj.a, such that the property 
”s mutual-dependent t” holds. 

Definition 6 Safe-Dependence. ”s safe-dependent t” iff 

(s temporal- unilateral-dependent t) V (s data- unilateral-dependent t). 
Example: s: y:= x -1 [] t: x:=x+l where s.r={ x } s.w={ y } t.r=t.w={ x } 
or always s. event precedes t. event. 

”L^ safe-dependent Lj” iff V s G L^.a A V t G Lj.a, the property ”s safe- 
dependent t” holds. 

Definition 7 Communication Closed Layer (CCL Structure). A group 
of actions Lj (j = l..k) is called a CCL strueture in the context of a program F, 
if there exists a partitioning of the program F, F.a = IjLj.a (j = l..k), such that 
for any two groups of actions Lj, Lj+i (j = l..k-l), ”Lj safe-dependent Lj + i”. 

Definition 8 Partial-Order Layer (POL Structure). A CCL strueture Lj 
(j = l..k) is called a POL strueture in the context of a program F, iff for Lj the 
relation ’’mutual-dependence ” is a transitive closure. 
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Lemma 1 CCL Progress/Safety Properties. If Lj (j = l..k) is a POL 

structure in the context of a program F, there exists LIj a predicate, such that: 

a) ^Hnvariant LI/^ holds in all consistent local states (also called global snap- 
shot [12]) determined by the executions of the program Lj. 

b) leads-to L/j+i” and "’^^(Lljj^i co Llj/^ holds for the program re- 
sulted by reunion of Lj and Lj^i. 

Proof: 

a) From definition 7, for any POL structure Lj (j = l..k), 3 predicate Llj such 
that: <V s, t G Lj.a , ”s mutual- dependent t”> = 

(Lj.init ^ Llj) A ( { Llj } F { Llj } ) = invariant Llj. 

b) For any POL structures Lj and Lj^i (j = l..k-l) 3 predicate Llj such 
that: temporal- unilateral dependent + 

”I/j safe-dependent Lj^f^^ leads-to Llj^f^ 
or data- unilateral dependent 

safe-dependent Ljj^f^^ Llj leads-to L/j + i”. 

The CO property is proved by contradiction; we demonstrate that 
”I/j safe-dependent Lj^f^ ^ + i co Lljf^ 

Lemma 2 CCL-POL Transformation. Any CCL structure L can be repre- 
sented as a safe dependent layer composition (denoted ” •” ) of POL structures Li 
l<i<m, with the following properties : 

a) L = Li • 1/2 • • Lrri‘ 

b) 3 invariant LT^ that holds in all consistent local states determined by 
the executions of L. 

Proof: 

a) Apply definition 7 and 8. 

b) For example: invariant LT^ = ^Hnvariant LICW ..V invariant LI^C^ 

Lemma 3 Union-CCL Transformation. Given a program F as a union 
composition: F = S'! [] *52 [] •• [] Sn- If there exists a partitioning of F, 

F.a = IjLj.a (j=l..k), such that any Lj (j=l..k) is a CCL structure^ then the 
union and the Communication Closed Layer compositions (denoted ”•”) are 
equivalent: 

F = [] ^2 [] .. [] . L2 • • Lk 

Discussion: The union composition can be transformed into an equivalent 
Communication Closed Layer composition, providing that no mutual interlayer, 
’’conflict” communications occur during the execution of any program Lj (j=l,k). 
Therefore, the union composition F can be equivalently rewritten as a composi- 
tion of Communication Closed Layers, based on the causal order of the events. 

3 DUALITY Design Model 

DUALITY, our design model for parallel programs, divides the solution space 
into two synergistically coordinated representations of the specifications and 
mappings. 
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(i) The problem- oriented speeifieation, ”What?,” is an architecture- 
independent representation of the computations. This is the premise to achieve 
transparency, for applications running in a parallel or distributed environment. 
Correctness, safety and progress properties apply globally to the transparent 
specification. 

(ii) The mapping^ ”How? When? Where?,” is an abstract description of how 
to execute a specification on a target parallel or distributed architecture. Map- 
ping represents architectural details, e.g. parallelism constraints, location aware- 
ness, adaptation to interprocessor communication constraints. Complexity and 
efficiency are major concerns of the mapping. 

We introduce Meta-Program as a consistent coordination of architecture- 
independent specifications and efficient implementations. A Meta- Program is 
a causal and concurrent, dual, composition of the adaptive mapping and the 
transparent specification. This notion extends Chandy and Misra’s program- 
schema [1]. 

Controlling the effect of nondeterminism is one problem that affects the de- 
sign of parallel applications In a Meta-Program there is a duality of causal and 
concurrent relations between specification and mapping events. This fact is due 
to the asynchronous character of the system components. However, one can iden- 
tify the duality of causal and concurrent relations between Meta- Program events. 
Let consider two agents, a\ and a2, sending the messages h\ and respectively 62 
to the same host, /i, in mailbox B. To represent the two actions one uses labeled 
assignments, x: := bi and respectively y: B^: = 62. The mapping expresses 

how the assignments are executed. To illustrate, for a mobile concurrent system, 
mapping can represent channel availability abstractions, mx: eonneet(ai,h) and 
my: eonneet(a2,h), respectively. Notation for nondeterministic interleaving of 
specification and mapping events: 

Coneurrent(mx: eonneet(ai,h), my: eonneet(a2,h), x: B^ := bi, y: B^:= 62 y); 
using a simplified notation: 

Coneurrent(mx, my, x, y). 

Under the interleaving semantics [6], a single execution of the Meta-Program 
is considered a totally ordered sequence of events. The semantics of the Meta- 
Program is given by the set of all possible total order executions: (x my), 
{my x), (x mx), (mx x), (y my), (my y), (y mx), (mx y). The two total-orders 
of independent events, (x my) and (my x) are equivalent, and so are the total- 
orders (y my) and (my y). One calls the unordered sequence of the events x 
or my a partial-order. Under this semantics, a single execution is considered to 
be a partially-ordered sequence of events. The Meta- Program semantics is given 
by six partial-order executions: (x my, my x), (x mx), (mx x), (y my), (my y), 
(y mx, mx y). Partial-order semantics makes distinction between causality and 
concurrency, and allows for a more expressive specification. Therefore, one can 
determine equivalent sequences of events. In this way, instead of analyzing all 
possible interleavings, we can only consider at least one representative from each 
equivalence class. This approach has two folds: 
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1. It allows for a simplified, less expensive and more efficient verification of the 
system’s properties. 

2. It supports better decisions among the number of possible alternatives, con- 
sequently a reduced design complexity. 

The semantics of POL structures and CCL structures allow verification of the 
properties of the parallel systems, without exploring all possible interleavings of 
nondeterministic executions. One benefit is the capacity to manipulate the layer 
composition as a partial-order, mapping operator. Initially, the Meta- Program is 
represented: 



MPs = Concur rent{mx^my^x^y) = mxWmyWx\\y (1) 

For all executions of the Meta-Program, there is a causal order between 
the events, mx and x and respectively nxy and y. Providing that there is no 
conflict between x and ruy and respectively y and mx , the ordering between non- 
conflicting events can be ignored. By applying definitions 5, 6, 7, 8 and Lemma 2, 
one identifies two POL structures: the specification (2) and the mapping (3). 

S = Concurrent{x^y) = x[]y (2) 

Ms = Concurrent{mx^my) = mx\\my (3) 

The Meta-Program expresses that the two POL structures are in safe- 
dependent, causal relation: 



MPs = Causal{Ms,S) (4) 

The relation Concurrent (s,t) means that the actions s and t can happen in 
any order, consequently: ”s mutual- dependent t” = Concurrent (s,t). The relation 
Causal(s,t) expresses the data or temporal causality between actions s and t, 
therefore we can write: ”s safe-dependent t” = Causal(s,t). From (2), (3) and (4), 
the Meta-Program expression 

MPs = Causal{Concurrent{mx^my)^Concurrent{x^y)) can be denoted in 
terms of layer and union composition: 

Ms»S = {m^ []my ) • (x[]y) (5) 

This design strategy leads us to the algebraic law of Communication Closed 
Layers, that allow for gradually adding details of temporal ordering to a Meta- 
Program that is already proven correct. 

Causal{Concurrent{mx^my)^ Concurrent{x, y)) = 

Concurrent{Causal{mx^x)^ Causal{my, y)) 

Consequently we write an equivalent expression of the Meta-Program: 



MPs = {rrix \\rriy)* (x[]y) = {m^ • x) [] (m^ • y) 



(6) 
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The algebraic law is reformulated MPy are Meta-Program compo- 

nents) in (7): 

MPs = MP:, []MPy = Ms •S (7) 

To make a generalization, let assume an architecture independent, transpar- 
ent specification. 



S = Concurrent{Si, Sn) = []..[] (8) 

A possible mapping to a shared memory parallel architecture is a union of n 
communication channels, denoted 

Ms = Concurrent{Msi, Msn) = {Msi[]..[]Msn)- (9) 

In a Meta-Program, which describes a parallel application, the mapping compo- 
nent is an adaptive layer that causally coordinates the events of the specification 
layer. The mapping program acts as a regulator that controls the specification. 
It can delay or inhibit basic interactions (assignments) or it can reallow them, 
in accordance with implementation constraints. The Meta-Program expression 
to reflect this idea is given by (10). 

MPs = C ausaliC oncurrent{M si^ Msn)^ Concur rent {Si, .., Sn)) (10) 

We formulate, in terms of UNITY, the generalized Communication Closed 
Layers algebraic law [7, 9], to allow transformations of the Meta-Program in 
forms that match shared memory or distributed parallel architectures. Each 
transformation step is carried out in conjunction with a formal verification phase, 
to increase confidence in the design correctness and to allow earlier error de- 
tection. This method is highly applicable in the initial stages of the design. 
Given Lij and Lki CCL structures, if there is a causal dependence between Lij 
and Lki then either i=k or j=l is satisfied, for 1 < i, j, k, 1 <n. The generalized 
algebraic law of Communication Closed Layers is formulated, 
Concurrent{Causal{Lii , .., Ti^), .., Causal{Lni , .., Lnm)) = 

Causal {Concurrent{Lii , .., L^i), .., Concur rent{Lim^ Lnm)) 
or an equivalent notation in terms of layer and union compositions: 

{L\l • .. • T )[]••[] (-L72I • • Lnm) — (-^11 0 •• []-^nl) • •• • {LimW“\\Lnm) (H) 

Applying (11) on the Meta-Program MPs (10), we obtain the DUALITY 
principle in practice: 

MPs = Causal{Ms,S) = 

Causal{Concurrent{Msi, Msn)i Concur rent {Si , .., Sn)) = 

C oncurrent{C ausal{M SI, Si ), .., Causal{Msn^ Sn)) = 

Concurrent{MPsi, MPsn)) 

DUALITY principle formulated in terms of layer and union composition: 

MPs = Ms • S and 



{Msi[]..[]Msn)^{Sl[]..[]Sn) = {Msi^Si)[]..[]{Msn^Sn) = M Psi[]. .[]M Psn (12) 
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By transforming the initial representation of the Meta-Program (4), we co- 
ordinate parallel architectural constraints with specific clusters of actions (as- 
signments). Consistent coordination implies that if the architecture indepen- 
dent specification of an application is 5'^, the correspondent parallel application, 
MPsi, can reuse 6 '^’s design knowledge. DUALITY supports the derivation of 
consistent parallel implementations from program specifications. Consequently, 
transparent and architecture-independent specifications can be transformed ef- 
ficiently into forms that match particular target architectures. Our approach 
relates data and process parallelism and aims to reuse design knowledge from 
classes of sequential design patterns, therefore, it is the basis to reduce the overall 
complexity. The correctness of the Meta-Program is deduced from the correct- 
ness of the architecture-independent specifications, preserved at each refinement 
step. More specifically, the correctness of the Meta- Program is inferred from the 
individually identified program modules: Groups- of- Aetions^ POL struetures or 
CCL struetures. 



Lemma 4 Safety of Duality. If invariant I holds for union (MPi [] MP 2 ) 
and the same invariant /holds for the layer composition {Ms • S), then the two 
forms are input /output equivalent: MP\ [] MP 2 = Ms • S . 



Lemma 5 Progress-Safety of Duality. Given MPs = Ms • S, invariant Ims 
holds for M 5 , invariant Is holds for S, then the formula (invariant Ims) leads- to 
(invariant Is) holds for the Meta-Program MPs. 

Diseussion: Ims is an assertion, on the global snapshot, associated with the 
layer Ms. Also, Is is an assertion, on the global snapshot, associated with the 
layer S. No matter how the Meta-Program evolves, if Ims is possible at some 
time, then Is is also possible at a later time. This property applies on the safe- 
dependent composition of the mapping and the specification, i.e. on the Meta- 
Program MPs. It supports a specification consistent implementation model for 
a parallel program, based on the duality of causality and concurrency. 

Related work has investigated manual proofs for Communication Closed Lay- 
ers [3, 12] and mechanized verification based on partial-order model checkers. 
Partial global snapshots in UNITY have been suggested in [10] as recording 
predicates that hold at some point in the computation. 

4 Example of Formal Transformation and Design Reuse 

To illustrate, we use the Floyd- Warshall algorithm [13], the shortest distance 
between all pairs of nodes. 



Notations: Quantified compositions, 

<[] i • l<i<n : 5'i> = S'! [] .. [] 5'n - union composition 

<11 i • l<i<n : = S'! II .. II 5'n - synchronous parallel composition 

<• i : l<i<n \ Li> = Li •..• Ln - layer composition 
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(I) Initial solution - an architectural independent, transparent spec- 
ification. A model of the program is a weighted, directed graph, G, with n 
nodes The weight of the edge connecting nodes t and u is denoted by w[t,u]. If 
nodes are not connected, w[t,u]=oo. In Fig. 2 is the initial specification (t, u, v 
- nodes and d[t,u] - distance between nodes t and u). We address the issue of 
correctness by expressing the safety and the progress properties and by verifying 
them against the initial specification. 

invariant{d[i^v] is shortest path from nodes t to u) holds for FI. 



Program FI 

initially <|| t,u :: d[t,u] = w[t,u]> 

assign <[] t,u,v : 1 < t,u,v <n:: d[t,u]:=min(d[t,u], d[t,v]+d[v,u])> 
end 

Fig. 2. Program FI (Floyd-Warshall) - architecture independent specification 



(II) Transformations of the initial solution based on causal and tempo- 
ral order. We transform (see Lemma 3) a union composition into a functionally 
equivalent composition of CCL structures. The quantified layered representation 
is given by program F2 (Fig. 3). 



Program F2 
assign 

<[]t,u : 1 < t,u <n :: d[t,u] = w[t,u]> • 

< • v :: 1 < v<n :: <[] t,u : 1 < t,u <n:: d[t,u]: =min(d[t,u], d[t,v]+d[v,u])> > 
end 

Fig. 3. Program F2 - the transformation in CCL structures 



Next we explain why this solution can be more efficiently mapped to an 
asynchronous shared memory architecture or to a static distributed architecture. 
We need to introduce some notations. 

Lq = <[] t,u : l<t,u<n :: d[t,u] = w[t,u]> is the initial CCL structure. Lq 
initializes the shortest distance between any nodes, t and u. Qtu is the notation 
for d[t,u]=w[t,u], therefore Lq = <[] t,u : l<t,u<n :: Qtu >• 

For any 1< i <n, Li = <[] t,u : 1 <t,u <m: d[t,u]: =min(d[t,u], d[t,i]+d[i,u])> 
is a CCL structure^ which calculates the shortest distance between any nodes t 
and u, through an intermediate node i. 

For the assignment d[t,u]: =min(d[t,u], d[t,i]+d[i,u]) we use the notation PfC 
and obtain Li = <[] t,u : 1 <t,u <n :: PfC >, where 1< i <n . Applying Lemma 3, 
the specification is rewritten as a layer composition of CCL structures: 



F2 = Lq • Li • .. • Lji 



(13) 
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li = invariant{d[i^v] is shortest path from t to u, through i) 
li is an assertion about the global virtual state associated with the CCL 
structure L^. By applying Lemma 5 and transitivity, we prove that formula: 
” Ik-1 leads-to 1/^” holds for any 1< k <n. 



(Ill) Transformations of the layered solution to match a distributed 
architecture, preserving safety and progress properties. From (13) and 
using the notations defined previously, the program F2 is rewritten: 

F2 = <[] t,u : l<t,u<n:: Qtu > • 

<[] t,u : l<t,u<n:: • 

<[] t,u : l<t,u<n :: 

By explicitly representing the quantified expressions we obtain: 

F2 = (QllD-DCnn) • (UlD-DUn) * - * D - D^’nn) (14) 

Form (14) applies to asynchronous shared memory architectures. We can take 
advantage of the minimal communication conflict closed ordering. Thus, if we 
apply the general algebraic laws of CCL to a program F2 we obtain: 

F2 = (Qn • • .. • Pi”i)[](Qi 2 • U 2 • •• • i"l” 2 )D-D(Qnn • PL • ■■ • PL) (15) 

Form (15) is more appropriate for implementation on distributed architec- 
tures. 



(IV) The Meta-Program MPp2 - a coordinated composition of the 
specification (F 2 ) with a mapping program to a distributed 

architecture. MPf2 = Causal(Mi?2 , F 2 ) = Mf2 • F 2 

To illustrate, the mapping program {Mf 2 ) stipulates an initial number of 
implementation details, which are related to communication channels. 

Mf2 = Mil [] Mi2 [] .. [] Mnn 

The notation Mm stands for a mapping program component that represents 
the actions on the communication channel Qn, connecting the nodes t and u. 
The Meta-Program representation is given by the CCL composition of the map- 
ping program with the specification F2 (13), provided in a form that matches a 
distributed architecture (15). 

MPf 2 = {Mil [] Mi2 [] .. [] Mnn) • (Fq •Li.... Ln) 

MPf2 = 

{Mil [] Mi2 D .. [] Mr,n) • 

((Qll • Pll^ • .. Pll”) [] (Qi2 • Pl2^ • .. P 12 ”) [] {Qnn • PnL • - • PnL) 

By applying the generalized laws of CCL on the Meta-Program MPp 2 we 
obtain: 

MPf2 = 

(Mil • (Qii • Pir • .. • PiiL) D 
(Mi2 • (Qi2 • Pl2^ • .. • P12L) D •• [] 

{Mnn • {Qnn • PnL • PnL • PnnD 
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We distribute the mapping component on all program specification compo- 
nents and obtain the quantified representation of the Meta-Program MPp 2 '- 

MPf 2 =< []t, u : 1 < t,u < n :: {Mm • Qtu)^ < •k : 1 < k < n :: Mm • Pm » 

(16) 

DUALITY principle applied on (16) indicates a consistent decomposition of 
the parallel program, coordinating the specification and mapping components. 

MPf2 = MPn[]MP,2[UWPnn (17) 

The Meta-Program MPf 2 is a causal coordinated composition of mapping 
and specification events. Specification is transparent and architecture- 
independent. Mapping deals with the implementation details of the parallelism. 
This separation of concerns gives more modularity. It helps in systematically 
identifying architectural and behavioral patterns, an aspect that can become 
the basis for a reduced design cost and less complexity. 



(V) The verification of the coordinated Meta-Program. We can infer the 
safety and progress properties in the Meta-Program MPf 2 by causally compos- 
ing mapping and specification data and temporal properties and by applying 
Lemma 4, Lemma 5 and the generalized algebraic laws of CCL. 

5 Summary and Future Work 

This paper describes DUALITY, a design model for parallel and distributed ap- 
plications, based on the systemic duality of causality and concurrency. Compared 
to data-parallel model [2], DUALITY, acts as a specification-consistent coordi- 
nation model for parallel program executions. It specifies temporal and causal 
ordering of events, both process and data parallelism. We integrate the simplicity 
of unity’s notation and logic with the modular structure of the Communica- 
tion Closed Layers. In this way, we provide a more structured style of parallel 
programming, while stressing the separation of concerns of correctness and effi- 
ciency. The correctness of parallel programs should not rely on implementation 
details related to a certain target machine or concerning issues of some particu- 
lar programming language or operating system. By introducing DUALITY, the 
correctness of parallel and distributed systems is addressed in a systematic man- 
ner with each transformation of the specification. Applicable for initial phases 
of the design, DUALITY principle supports a simple formalism that, in the end, 
can be mapped onto real programming languages. This approach leads to fewer, 
but more relevant design alternatives, therefore less design complexity. Our def- 
inition of Meta-Program extends the UNITY’S notion of program-schema. The 
Meta-Program reflects the duality of concurrency and causality relations between 
events in a parallel system. DUALITY allows the reuse of architectural patterns 
and proofs of sequential algorithms in the design of parallel programs. 

We are currently working on to extend formalism of Communication Closed 
Layers composition, defined in terms of UNITY and branching temporal logic. 
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Ongoing work integrates a partial-order model checker [6] to provide a mech- 
anized verification for specifications and mappings of parallel and distributed 

programs. 
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Abstract. In the framework of self-stabilizing systems, the convergence 
proof is generally done by exhibiting a measure that strictly decreases 
until a legitimate configuration is reached. The discovery of such a mea- 
sure is very specific and requires a deep understanding of the studied 
transition system. In contrast we propose here a simple method for prov- 
ing convergence, which regards self- stabilizing systems as string rewrite 
systems, and adapts a procedure initially designed by Dershowitz for 
proving termination of string rewrite systems. 



1 Introduction 

Introduced by Dijkstra, with three mutual exclusion algorithms on a ring of pro- 
cesses [10], the notion of self-stabilization has been largely studied for the last 
ten years (see [28,30] for surveys). In this paper, we consider a system which con- 
sists of a ring of machines controlled by a “central demon” . Its configuration is 
the concatenation of the components local states and it is characterized by a set 
R of transitions defined over configurations. The system is self- stabilizing with 
respect to a subset L of legitimate configurations when, regardless of the initial 
configuration and regardless of the transition selected each time by the central 
demon, it is guaranteed to arrive in a configuration of L within a finite number 
of steps. The set L is assumed to have a elosure property: from a legitimate 
configuration in L, the system persistently remains in L. It is also frequent to 
assume that there is no-deadloek. With these two hypotheses, it is easy to show 
that a system is self-stabilizing iff it has the no-eyele property: there is no cyclic 
sequence of transitions which contains some configuration w ^ L. This property 
is often shown by exhibiting a norm funetion defined over the set of configu- 
rations, whose value strictly decreases after each transition (or each bounded 
sequence of transitions) as long as the configuration is not legitimate [30] . Since 
such a measure is usually very specific to the considered system, finding one is 
very difficult and requires a deep understanding of this system (see e.g.[22,13,4]). 

We propose here a new approach for proving the absence of cycle. Configura- 
tions are viewed as words of a formal language, transitions of R as rewrite rules. 
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and the no-cycle property as a variant of the nontermination property for rewrite 
rules. More precisely, we will attack the no-cycle property under the following 
equivalent form: there is no infinite sequence of transitions of R~^ (the reverse 
system obtained from R by switching origin and target configurations) starting 
from a configuration w ^ L. The absence of such infinite sequences will then 
be shown by refining the generation procedure of reduction chains^, first pro- 
posed by Dershowitz for proving string rewriting termination [7]. The method 
proposed here is new in that: 1) it uses a general technique of string-rewriting 
to deal with self-stabilization; 2) instead of working with the direct rewriting 
relation from arbitrary configurations towards the legitimate ones, it uses the 
inverse relation; 3) it does not consider all the possible rewrite derivations, but 
only “representative” ones, using a restricted rewriting strategy. 

Related work on self-stabilization proofs. [10] is without proof. In [11], a 
correctness proof is given for the third (3-state) algorithm of [10], by showing 
properties of executions using behavioral reasoning. As already pointed out, al- 
most all further proof methods (cf. [30]) are based on norm functions (see an 
example in [18]) but, as expressed by Gouda [14]: “It has been my experience that 
the ratio of time to design a stabilizing system to the time to verify its stabiliza- 
tion is about one to ten” . For simplifying proof process, general paradigms have 
been proposed: various protocol compositions making proofs modular ([3,15]), 
attractor or staircase methods ([14,28]), automatic transformations into stabi- 
lizing systems [2]. Some ideas in [17] are also used in our approach, as discussed 
in Section 7. Note that, recently, some works were done for proving convergence 
without appealing to a norm function: [31] uses techniques borrowed from con- 
trol theory and [1] induction techniques over the set of configurations. 

Related work on rewrite techniques applied to distributed systems. Al- 
though viewing transitions as rewrite rules is rather natural, the application of 
general rewrite techniques for proving properties of distributed systems has not 
been explored to our knowledge, except in [24,25,26] where graph rewriting tech- 
niques (with priorities) are used to prove the correctness of various distributed 
algorithms (election, spanning-tree construction,...). However this work does not 
address the issue of self-stabilization. 

Plan of the paper. Section 2 gives an intuitive presentation of the method with 
the example of Ghosh’s 4-state algorithm. Section 3 shows how self-stabilization 
can be viewed as a property of string rewriting systems. Our basic method and 
the underlying correctness result are explained in Section 4, then refined in Sec- 
tion 5. Section 6 relates compositionality results in the field of self- stabilization 
and in the field of rewriting. Section 7 concludes with final remarks and perspec- 
tives. 



^ also called forward closures in [8]. 
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2 An Illustrating Example: Ghosh’s 4-State Algorithm 

Ghosh’s algorithm is a variant of Dijkstra’s 4-state self-stabilizing algorithm [13]. 
The system consists of a parametric number N of machines (0, 1, • • • — 1), 

which have four states: {0, 1,2,3}, except the top machine N — 1 (resp. bottom 
machine 0) which has only two states: {0, 2} (resp. {1,3}). The configuration of 
the system is the string of all machine states, delimited by special end symbols 
Writing X, T for nonempty string variables, the transitions correspond to 
the following system R = Middle U Top U Bottom of rewrite rules. 

- Middle is made of rules of the form: 

Ml : #X{q + l)^y# ^ #X{q + l){q + 1)T# 

M 2 : #Xq{q + 1)T# ^ #X{q + l){q + 1)T# 

where q G {0, 1, 2, 3} and is addition modulo 4. 

- Top is made of rules of the form: 

Ti : #X32# ^ #X30# and T 2 : #X10# ^ #X12# 

- Bottom is made of rules of the form: 

Bi : #12X# ^ #32X# and B 2 : #30X# ^ #10X# 

Ghosh proves the convergence by considering a norm function {Br, Ds) such 
that either Br or Ds strictly decreases each time a transition is done. Individ- 
ually Br and Ds are non-increasing functions. Br is the number of breaks^ i.e. 
the number of neighbouring states q^ q' of the string which differ by at least one 
unit. Ds is a very subtle function measuring the sum of distances between pairs 
of neighbouring breaks of the string. All the difficulty of Ghosh’s proof comes 
from the discovery of such a measure Ds. In contrast, we now informally explain 
how our method proves the convergence, with the help of measure Br only. Our 
first idea is to consider not the original system R but the inverse R~^ obtained 
by switching lefthand and righthand sides. Since Br is non-increasing via T, it 
is non-decreasing via R~^. Our second idea is to prove convergence by show- 
ing that there is no infinite derivation via R~^ (except those among legitimate 
configurations, which are not displayed here for lack of space). We will focus 
on Tr-preserving infinite derivations, i.e. infinite derivations which preserve the 
number of breaks: since Br is non-decreasing and bounded by the total number 
N of system components, any infinite derivation has an infinite Tr-preserving 
sufhx, so there is no loss of generality. For instance, the two following reductions 
(with underlines indicating positions of a substring to be reduced) are discarded 
because they increase Br by one: 

#1123(0123) • • • (0123)2 • • • 2# ^ #1023(0123) • • • (0123)2 • • • 2# 

#1123(0123) • • • (0123)2 • • • 222 • • • 2# ^ #1123(0123) • • • (0123)2 • • • 212 • • • 2# 
Our third and main idea is to generate not all the possible Tr-preserving deriva- 
tions via R~^ but only “representative” ones, in the sense that an infinite deriva- 
tion exists iff an infinite representative one exists. As explained later, representa- 
tive derivations are obtained by reasoning with Ist-order variables that represent 
arbitrary sets of strings, and instantiating these variables in a “minimal” way 
through successive steps. For pedagogical reasons, we first assume that appro- 
priate instantiations are known from the beginning, and present the treatment 
of Ghosh’s algorithm, starting from a string sq with all its variables already in- 
stantiated. 
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Generation of representative strings. 

We start from: 

50 • #7t( 0123) • • • (0123)3 • • • 30#, where tt is either 3 or 123. Now so reduces via 

Tf ^ to 

51 • #7t( 0123) • • • (0123)3 • • • 332#, which reduces via a (possibly empty) sequence 
of to 

5 2 : #7t(0123) • • • (0123)3 • • • ^2 • • • 2#. 

At some point, one generates 

5 3 : #7t( 0123) • • • (0123) (0123)32 • • • 2#, which reduces via successively to 

5 4 : #7t(0123) • • • (0123) (01223)2 • • • 2#, 

5 5 : #7t(0123) • • • (0123) (01123)2 • • • 2#, 

5 6 : #7t(0123) • • • (0123)(00123)2 • • • 2#, 

5 7 : #7t(0123) • • • (0123)(30123)2 • • • 2# 

A generalized form for S 3 and sr is 

5 8 : #7t(0123) • • • (0123)(01233)(0123) • • • (0123)2 • • • 2#, 

which contains a substring (01233) between a left and right sequences of (0123) ’s. 

S 8 rewrites via successively to 

59 : #7t( 0123) • • • (0123)(0m3)(0123) • • • (0123)2 • • • 2#, 

sio : #7 t(0123) • • • (0123) (01123) (0123) • • • (0123)2 • • • 2#, 

sii : #7 t(0123) • • • (0123)(00123)(0123) • • • (0123)2 • • • 2#, 

S 12 : #7t(0123) • • • (0123)(30123)(0123) • • • (0123)2 • • • 2#. 

which can be seen as a copy of S 8 where an element (0123) of the left sequence has 
moved to the right sequence. Such an iterated application of always ends when 

the left sequence is empty, thus yielding: 

Send : #7t(30123)(0123) • • • (0123)2 • • • 2#. 

In case tt is 3 there is no possible Br-preserving rule application. 

In case tt is 123, there are two ultimate applications of yielding successively: 
s'end : #m3(0123)(0123) • • • (0123)2 • • • 2#, 

4 ;, : #1123(0123)(0123) • • • (0123)2 • • • 2#, 
with no Br-preserving rule applicable to s^nd- 

This result shows that there is no infinite 5r-preserving derivation starting 
from So- It remains to explain how the initial string sq was inferred. The basic 
idea is to reason on strings with variables at the first-order level. Variables are 
instantiated in a “minimal” manner so that reduction rules become applicable. 
We start with the lefthand side t^X 30# of top rule which is reduced to 

#V32t^. Then X is bound to Vi3, and instance #Xi332# reduces via to 
#Xi322#. This operation is iterated, yielding #Xn32^2#. Then Xn is instanti- 
ated with V3, but now is applied, which yields #V232’^2#. Then Y is in- 
stantiated with Z2, and application of yields Z1232^2^. Iterating this op- 
eration generates strings of the form 7 ^ Wi ( 0123 ) 2 ^^ 27 ^, • • • , #Wm (0123)'^ 2^^ 2^, 

• • • . Similarly, another representative string derivation can be obtained, starting 
from the other top rule right hand side ^X12^. 

As a recapitulation, our strategy consists in generating “representative” chains 
of derivations by starting from righthand sides of top rules and applying Br- 
preserving rules (from right to left). One then shows that only finite derivations 
are obtained (except over legitimate configurations). The finiteness of derivations 
will be proved formally by finding a generic set 71 of “growing patterns” , i.e. a 
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set of strings such that any 5r-preserving rule applied to any element t G 71 
yields an element u G II greater than t, for some ordering > over 71. In the case 
of Ghosh’s algorithm, 71 is of the form: {7^X7 t'( 0123)’^2^27^ | m > 0,n > 0} 
where tt' is 3, 23, 123 or 0123. The associated relation > merely orders elements 
of 77 according to the number m of substrings (0123) they contain, (m is in- 
creased by one, each time four rules are applied consecutively.) Our underlying 
claim is that such a strictly monotonic ordering > is much simpler to find than 
Ghosh’s strictly monotonic norm {Br, Ds) because we focus on “representative” 
derivations using a restricted strategy of rule application, instead of considering 
all the possible configurations. The rest of the paper is devoted to the formal 
description of our strategy and a proof of its correctness. 

3 Self-Stabilizing Systems as String Rewrite Systems 

We first recall some basic definitions from (string) rewrite systems ([9], [5]). The 
words considered here are generally delimited by a leftmost and rightmost special 
symbol ‘ 7 ^’. The symbols appearing between them belong to a finite alphabet B 
or a set V = {W, X, T . . .} of variable symbols. A string is an element of X*, with 
5 for the empty string. A ground word is an element of and an (open) 

word is an element of 7^(77 UV)*#. A substitution is a mapping 0 from V to 
(X U V)* with 0{W) = W almost everywhere except on a finite set of variables 
denoted Dom{0). A substitution 0 is represented by a finite set of pairs of the 
form {W/0{W)}weDom{e)' A substitution 0 is ground when 0(W) is in X*, for 
all W G Dom{0). 



String Rewrite Systems. The string rewrite systems considered here contain 
length-preserving rules, divided into three subsets: top rules in Tops are applied 
to the rightmost part of words; bottom rules in Bottoms are applied to the 
leftmost part of words (or simultaneously at both ends); the rest of rules in 
Middles are called middle rules. More precisely, let 7, r (resp. for i = 1,2) 
be nonempty strings of X* of the same length, and X, Y variables, 

- Middles is made of rules of the form: t^X 7T# ^ #XrT# 

- Tops is made of rules of the form: 7 ^X 77 ^ ^ 

- Bottoms is made of rules of the form: 

#7X# ^ #rX# or MM# ^ #riXr 2 #. 

We are going to apply these rules either to ground words or to words of the form 
#uWv# where u,v denote strings over X*, and W a variable. 

Example.^ 

Bp : #1X2# ^ #2X1# is a bottom-rule. Tp : #X21# ^ #X12# 

is a top-rule. Mp : #X01F# ^ #X10F# and Mp : #X20F# ^ 

#X02Y# are middle rules. 

^ these rules are the inverse of Bi,T 4 ,Mi,M 4 from Beauquier-Debas algorithm (see 
section 5). 
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Ground Reduction. A ground word w is reducible via a rule of the form 
iS w = for some strings u^v G One also 

says that w is an instance of the rule lefthand side via the ground substitution 
{Xju, Y/v}. The reduced form of w is w' = Reduction via a rule of the 

form ^ #^i-Ar 2 # is defined in a similar way. 

A word w reduces to w' via S', written w (or sometimes simply 

w ^ w')^ a w' is the reduced form of w via some rule of S. We say that S is 
non terminating iff there exists an infinite sequence of reductions via S starting 
from some ground word w. Otherwise, S is said to be terminating. 

Example. The word w : #1012# reduces via rule : #1X2# ^ #2X1# 
to w' : #2011#, using the substitution {X/01}. 



Self-Stabilization. We are now able to give a formal definition of self-stabilization 
for a system modeled as a string rewrite system R. From now on, configurations 
are regarded as ground words. Writing Ln for the set of legitimate configu- 
rations in a system with N machines, we define the global set of legitimitate 
configurations^ as £ = UAr> 2 AAr- 

Definition 1. A rewrite system R is self- stabilizing with respect to set £ iff: 

(0) Each ground word is reducible via R. 

(1) £ is closed via R, i.e: w G C A w w' ^ w' G £, for all ground 
words w,w' . 

(2) There is no ground cyclic derivation of the form wi ^r • • • ^r Wn = wi 
with wi ^ £. 

Statement (0) expresses a no- deadlock property, (1) a closure property for 
£, and (2) a no-cycle property. It easily follows from this definition that any 
“maximal” derivation is infinite and reaches the set £: this corresponds to a 
convergence property (also called no-livelock property in [6]). 

Assuming (1), we have two equivalent versions for (2): 

(2') There is no ground cyclic derivation via S = of the form vui 
• • • = '^1 with vui ^ £. 

(2") There is no infinite ground derivation via S = R~^ starting from a word 
w ^ C. 

Proof (2) (2') (2"). One has clearly (2) (2') and (2") ^ (2'). Let us 

show (2') ^ (2") by showing ->(2") ^ “>(2'). Suppose ->(2"): there is an infinite 
ground derivation A via S: u\ Ui with ui ^ £. Since S is 

length-preserving there should be a cycle in this derivation, so there is an initial 
part of A which is of the form: ui Uj '^n = for 

some j and n. Since u\ Uj^ we have Uj ui. It follows Uj ^ £ because, 
otherwise, ui would be itself in £ by assumption (1) of closure. Therefore the 
subpart uj Un = Uj is a cycle with Uj ^ £. One has thus exhibited 

a cycle via S containing an element not in £, which proves “>(2'). 
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Remark 2. Note that equivalence between ( 2 ) and ( 2 ") does not hold any longer 
if R is used instead of R~^ within ( 2 "). 

4 A First-Order Characterization of Cycles 

Assuming now that R satisfies (0) and ( 1 ), we will focus on the problem of 
proving the no- cycle property, stated under form ( 2 "). Our method relies on 
a first-order characterization of cycles: we will show that an infinite ground 
derivation via R~^ (as mentioned in ( 2 ")) is actually an instance of an infinite 
derivation at the “first-order” level. In order to state our main result, we need 
the notion of top chains, which is transposed from [7] in our particular context. 



Minimal Reductions and Top Chains. We now deal with one- variable words 
t of the form 7 ^ 1 ^ Wu#, with u G A* and W G V, and we consider the minimal 
complete set of substitutions /ii, ...,/i/c which make t reducible via a given rule 
of the form ^ ^ r, i.e, the set of most general unifiers of t with a rule lefthand 
side 1. The general unification problem for words is complex, and was solved 
by Makanin [27]. However our particular unification problem here is simple, be- 
cause t and £ do not share variables, and are “linear” (i.e. contain at most one 
occurrence of the same variable). In such a case the number of most general uni- 
fiers is finite: roughly speaking, it suffices to consider all the manners in which 
t and I overlap depending on the possible instantiations of their variables (see, 
e.g., [ 20 ]). Suppose now that t and £ are unifiable via a set of most general uni- 
fiers yUi, ...,/r/c {t/jLi = ifii for i = 1 , ..., /c), so that each instance tfii of t reduces 
to rfii. We will disregard unifiers fij which instantiate t at a “variable position”. 
We then say that t is minimally reducible at a nonvariahle position via the set 
of unifiers {/ii, ..., /i/e}. (This corresponds to the operation called “narrowing” in 
first-order rewrite systems [29,12,19].) The corresponding set of minimal reduc- 
tion steps is {tpi r/ii}i=i,...^/e. 

Example. 

Suppose we have a rule of the form ^XllY^ ^X22Y^. The word t : 
T^lWl# unifies with lefthand side ^XllY^ via most general unifiers: 

/ii : {W/IW'} U {X/e, T/WT}, /i2 : {W/W'l} U {X/IW', Y/s}, 

: {W/£}U{X/£,Y/s} and /i 4 : {W/W 1 IIW 2 } U {X/lWi,Y/W 2 l}. 

The last unifier (with the associated reduction) is discarded because it cor- 
responds to a unification taking place at a variable position of t. The mini- 
mal reductions of t corresponding to yUi,/i 2,/^3 are: 7 ^ 22 IT'l 7 ^, 

^ # 1 W' 22 # and # 11 # ^ # 22 #. 

Definition 3. The top minimal reduction chains (or simply top chains) of rewrite 
system S form a set of derivations inductively defined as follows: 

- Every top rule #X^i# ^ #Xri# of S' is a top minimal reduction chain. 

- IfC: u^-'-^reisa top minimal reduction chain and ^ ^ r is a 
rule of S such that w unifies with ^ at a nonvariable position via the set of most 
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general unifiers {/ii, ...,/i/c}, then vj^i ^ ^ wfii rjHi (for i = 1, k) is a 

top minimal reduction chain, called successor of C via £ ^ r. 

The transitive closure of the successor relation is called “iterated successor” . It 
is easy to see that, due to the form of the rules of the string rewrite system *5, 
a top chain ti ^ ^2 ^ ^ is such that either all the words ti, are 

ground, or each ti {1 < i < n) is of the form ^UiWvi^ where the UiS (resp. Vis) 
are strings of of the same length. 

Example. Consider the top rule ^ ^]T12# viewed as a 

top chain. Its righthandside unifies with the lefthand side of : 

^1X2^ 7^2X1t^ via the set of most general unifiers fii : {W/e} U {X/e} and 

1^2 • {W/IW'} U {X/X'l}. Therefore the successors of via are: 

#21# ^ #12# ^ #21# and #11T'21# ^ #11T'12# ^ #21T'll#. 

Note that the requirement that the starting chain is a top rule instead of 
a general rule of S is new with respect to Dershowitz’s definition [7]. Such a 
requirement is made possible because we will assume additionally that system 
S — Tops is terminating (see Theorem 5). 

Definition 4. A top chain (resp. ground derivation) ti ^ ^ ^ is 

quasi- cyclic if U = tn for some i < n, and tp # tq for all distinct p, q less than n. 



A Characterization of Self-Stabilization. We can now state our main result: 



Theorem 5. Let R = Middle r U TopR U Bottom r he a rewrite system and let 
C he a set of configurations such that: 

(0) each ground word is reducible via R, 

(1) C is closed via R, and 

(3) R — TopR (or, equivalently, {R — Topr)~^ ) is terminating. 

Then R is self- stabilizing w.r.t. C iff there is no quasi- cyclic top chain ti 
' ' ' ^ tn via R~^ , such that t[ ^ C for some ground instance t[ of t\. 

The proof of theorem 5 involves additional notions of “active” or “inactive” 
steps within infinite ground derivations, as introduced by Dershowitz [7]. The 
corresponding definitions and properties, as well as the proof, are given in the 
Appendix. 

In order to mechanically check point (3) of theorem 5, i.e. termination of 77 — 
TopR, one can use classical well-founded orderings used in rewriting theory [9]. 
One can also use Dershowitz’s chain test: generate all the general chains until 
either one “cycles” in the sense of [7] (non-termination detection) or all terminate 
(termination proof). Dershowitz’s procedure can also be refined if one knows that 
R — TopR — Bottom R is itself terminating. Then, in order to prove termination 
of 77 — TopR, it suffices to generate only the “bottom” chains via 77 — TopR, 
i.e. chains that start from a bottom rule, and check that none of them has a 
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cycle. (Of course, in this case, one has to check additionally the termination of 
R — TopR — Bottom R^ but this is generally easier.) 

5 A Practical Proof Method 

We now present our basic procedure, as well as a fully treated example of appli- 
cation. 



5.1 Basic Procedure and 0- Refinement 

Theorem 5 suggests to prove self-stabilization by generating all top chains via 
R~^ and checking the condition over the initial word t\ for quasi-cyclic chains. 
Unfortunately, top chains, when generated in a brute manner, are frequently in 
infinite number. However it is often possible to discover some recurrent forms, 
called “patterns”, for words tn appearing at the end of chains. For example, 
starting from word 7^2kFll#, and applying Beauquier-Debas system S = R'~^ 
(see below) with rules #X20F# ^ i^X02Y# and ^ (with 

Y 7^ 5), one generates chains that all end with words of the form ^0-^2kF10^1# 
with j^k > 0. Formally: 

Definition 6. A set of patterns 71 is a set of the form 

where ai, ..., a^, 61, ..., 6^ are letters of U, ii, ..., ji, ..., are natural numbers, 
and an arithmetical relation. 

The discovery of such patterns allows us to characterize finitely an infinite num- 
ber of top chains. These patterns will be required to be elosed via S', i.e.: for all 
7T G 71 and all minimal reduction of tt to tt' via S using unifier p {irp ^ tt'), tt' is 
in 71. Additionally, tt' will be required to be greater than irp for some ordering > 
over words, compatible with substitution i.e. such that: ti > t 2 tiO > ^2^, for 
any ground substitution 0. Such requirements prevent cycles within derivations 
of the form ti ^ ^ ^ ^ ... ^ tn+m^ as far as G 71 and no word 

of 71 unifies with ti {1 < i < n) . More precisely: 

Definition and Property 7. An S -closed set of growing patterns 77 is a set 
of patterns such that, for all tt G 77 and all minimal reduction of tt to tt' via S 
using p {np ^ tt'): tt' G 77 and tt' > irp for some ordering > compatible with 
substitution. 

If a non quasi-cyclic chain C : U ^ ^ is such that tn belongs to an 

S'-closed set 77 of growing patterns nonunifiable with any ti, then any 

iterated successor of C of the form: ti{pi'-pm) ^ ^ tn{pi'''Pm) 

• • • k^m) ^ ‘ * k^m) IS itself uou quasi-cyclic. 

Example. Consider the set 77 : {7^0-^2kF10^1# \ j > 0,k > 0}. The only 
rules of Beauquier-Debas system S = R'~^ (see section 5.2) that are applicable 
to 77 are: 
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#X20F# ^ #X02Y# and #X01F# #X10F#, with Y ^ s. 

Given tt : the first rule (resp. second rule) minimally reduces it 

to 7t[ : (resp. tt^ : #0-^2VP10^+^l#), which belongs to 71 and is 

greater than tt with respect to the number of ‘O’s to the left of ‘2’ (resp. number 
of ‘O’s to the right of leftmost occurrence of ‘1’). Therefore 77 is an S'-closed set 
of growing patterns. 

Using theorem 5 and property 7, and assuming (0), (1), (3) for 77, we obtain: 

Basic Procedure 

Start with set of top rules Tops (with S = R~^) as an initial set of chains. 

For each chain h ^ ^ tn, compute iteratively its successors via rules of S 

unless: 

- U G or 

- G 77, for some ^'-closed set 77 of growing patterns nonunifiable with any 
U 5 • • • 5 7n— 1 • 

If, among all the generated chains, there is one quasi-cyclic chain ti ^ ^ 

ti ^ ^ tn = ti such that ti has a ground instance ^ £, then 77 is not 

self-stabilizing. Otherwise 77 is self-stabilizing. 

Remark 7. In order to detect if, given ti, there exists a ground instance not 
belonging to £, one may assume that £ is characterized by a regular language 
(see [17]). If ti is ground, then the problem reduces to t\ ^ £, which is decidable. 
If ti is not ground, then ti is of the form and the problem reduces to 

test n£ ^ 0, which is also decidable. Note besides that the assumption 

of regularity for C allows us to mechanically check property (1) of closure via 
77. This is equivalent to check that the image of C via is included into C. 
Since C is regular and ^r can be seen as a rational transduction, this inclusion 
problem is decidable (see [23], [17]). 

Our basic procedure can be refined by using a measure, say 0, over ground 
words which “never increases” when applying a rule, such as Br in section 2. 
Finding such a measure is generally much easier than finding a norm, which 
must “always decrease” when applying a rule (or a bounded number of rules) of 
77. Formally, we assume given a measure 0 from words over (77 U V)* to N, with 
the following properties: 

- 0 is non-inereasing with 77, i.e., is such that t ^r t' implies (j){t) > cj){t'). 

- 0 is eompatihle with substitution: (j){t) > (j){t') implies (j){t0) > (j){t'0) for any 
ground substitution 0. 

Given 0, it is easy to show that the basic procedure can be refined by generating 
only (j) -pres erring successors via S = 77“^ instead of all the possible successors. 
(A successor tip ^ tnP tn+ip of U ^ ^ tn is ^-preserving iff 

(j){tnP) = 0(tn+l/i).) 
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5.2 Beauquier-Debas Algorithm 

This system originates from [4], and is an adaptation of Dijkstra’s third (3-state) 



algorithm [10]. 


In our formalism, it corresponds to the following system R: 


Bottom Bi : 


#2x1# ^ #1x2# 




Top Ti : 


#X00# ^ #X21# 




T2 : 


#X10# ^ #X01# 




Ts : 


#X20# ^ #X11# 




T 4 : 


#X12# ^ #X21# 




Ts : 


#X22# ^ #X01# 




Middle M± : 


#xiox# ^ #xoiy# 


(with Y 


M2 : 


#X11X# ^ #X02X# 


(with Y ^ e) 


Ms : 


#X12X# ^ #X00X# 


(with Y ^ e) 


M4 : 


#X02X# ^ #X20X# 


(with Y ^ e) 


Ms : 


#X22X# ^ #X10X# 


(with Y ^ e) 


C is defined as: 


#0*20*1# U #0*10*2# . 





In this example, it is assumed that the sum of the elements of the initial configu- 
ration is null, modulo 3. This property is preserved when applying the rules of R. 
One checks easily that any ground word (with a null sum of elements) is reducible 
via i?, and C is closed via R (see [4]). Therefore, R is self-stabilizing iff there is 
no ground cyclic derivation via R containing an element re 0 As remarked in 
[4], one can see that Ti,T 2 ,Ts are applied at most once. As a consequence R is 
self-stabilizing iff there is no ground cyclic derivation via Rq = R — {Ti,T 2 ,T 3 } 
containing an element w ^ C. The measure (j) over a word t G (i7 U V)*, is 
defined as the number of nonnull elements contained by t. Obviously, is non- 
increasing with Rq and compatible with substitution. Besides, among rules of Rq 
only rules 5^^, preserve the number of nonnull elements. The 

0- refinement of the basic procedure thus consists in generating top chains via 
R'~^ = instead of Rq^. The chains generated this way 

are all derived from the single initial top chain (corresponding to T^^): 

(co) #X21# ^ #A12#. 

Immediate successors of (co) are obtained by minimally reducing word #A12# either 
via or via . In the first case, we generate: 

(cj) #X021# ^ #X012# ^ #X102#. 

In the second case, we generate: 

(cn) #21# ^ #12# ^ #21#. 

(ci 2 ) #1X21# ^ #1X12# ^ #2X11#. 

More generally, the first successors of (co) are obtained by applying i times (i > 0) rule 
then possibly rule B^^ . The application of rule i times to (co) yields: 

(4) #X0^21# ^ #X0M2# ^ > #X10^2#. 

The application of B^^ then yields: 

(4i) #0*21# ^ #0*12# ^ #0*-il02# ^ ^ #10*2# ^ #20*1#. 



® Such a rule Mi with condition X # e should be regarded as an abbreviation for 3 
rules Mia ■ #X10aX# ^ #X01aX#, with a e {0, 1, 2}. 
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(ci2) #1X0^21# ^ #lX0n2# ^ #1X0^-1102# ^ ^ #1X10^2# ^ 

#2xion#. 

The first element #0*21# of {c\i) belongs to so there is no need to com- 
pute the successors of {c\i). The last element #2X10*1# of (0^2 ) belongs to 
n : {#0-^ 2X10^1# \ j > 0,k > 0}, which is an i?'~^-closed set of growing pat- 
terns (see example above). Furthermore, the other elements of (c\ 2 ): #1X0*21#, 
#1X0*12#, • • • , #1X10*2# cannot unify with any element of II (as they start 
with 1 instead of 0 or 2). Therefore we do not need to compute the successors of 
{C 12 ) either. So the procedure of top chain generation ends. The only quasi-cyclic 
chain generated is (c^i) = (cn) = #21# ^ #12# ^ #21#, which starts with 
an element of £. Self-stabilization is thus proved for Beauquier-Debas’s variant 
of Dijkstra’s 3-state algorithm. Self-stabilization of Ghosh’s algorithm [13] can 
be proved formally along the same lines, as sketched out in section 2. 

6 Compositionality 

It is interesting to relate compositionality results obtained in the context of 
self-stabilizing systems with those obtained in the context of rewrite systems. 
Similarly to what Dershowitz proved w.r.t. composition of terminating systems 
(see theorem in [7], p.456), one can derive from theorem 5 sufficient conditions 
for self- stabilization of the combination of two self- stabilizing systems. 

Theorem 8. Assume that: 

- Ri is self- stabilizing w.r.t. C\. 

- R 2 is self- stabilizing w.r.t. £2 when starting from C\. ^ 

- there is no overlap between lefthand sides of Ri and righthand sides of R 2 . 

- all exeeutions are fair w.r.t. both Ri and R 2 . ^ 

Then R\ U R 2 is self- stabilizing w.r.t. £2* 

This can be seen as a version Herman’s compositionality result [16] in our con- 
text. 

7 Conclusion and Perspectives 

In contrast with methods relying on the existence of a strictly decreasing norm 
function [4,13,18,22], our method requires only little specific knowledge and pro- 
poses a uniform framework for the full proof of several non trivial examples, 
as shown here on Ghosh’s 4-state algorithm [13] and Beauquier-Debas’s 3-state 
algorithm [4]. These examples are simple ones, which allow us to give a clear 
view of the procedure. Besides, the corresponding problem is interesting in itself 

^ “when starting from £1” means that condition (0) is replaced with “(0^) Each ground 
word of £1 is reducible via i?2” in definition 1 of self-stabilization for R 2 w.r.t. £2. 
^ See, e.g., [30], p. 476, for a formal definition of “fairness”. 
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because it is related to mutual exclusion and has a non trivial specification. Our 
procedure is inspired by Dershowitz’s chain generation procedure [7], and proves 
convergence of self-stabilizing algorithms much in the same way as Dershowitz 
proves the termination of rewrite systems. The method is not fully automatic: 
we need in particular to infer by hand generic “patterns” of configurations from 
words produced recurrently throughout derivations. Note that patterns have 
been also used in previous works on self-stabilization, e.g. in [17] where they are 
expressed under the form of regular languages. The main differences with [17] 
(and other traditional proof methods of self-stabilization) come from: 

1 . considering the reverse transition relation R~^ instead of R] 

2 . focusing on derivations (via R~^) originating from top- configurations in- 
stead of derivations (via R) starting from any configuration. 

3. deriving new configurations (via R~^) through a restricted strategy of rule 
application (top chain generation) instead of deriving all the possible successor 
configurations (via R). 

We have also given a natural counterpart of Herman’s compositionality re- 
sult in our framework. We are currently investigating an extension of our proof 
method in two directions. We first want to consider more realistic token ring 
algorithms, like different versions of the IBM token ring or FDDI protocols, in- 
volving a change from the state reading model to the message passing model. In 
this case, channel states must be modeled and the hypotheses must be slightly 
modified to consider self-stabilization. The second natural way is an extension to 
algorithms running on arbitrary (non-ring) networks. In this framework, strings 
must be replaced by graphs. We would then have to use graph rewriting tech- 
niques, as proposed in [24,25,26] for other properties of distributed systems. As 
a promising first step, we consider the case of tree networks, in which leaves can 
receive markers, in the same way that top and bottom nodes receive ’ 7 ^’. 
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Appendix: Proof of Theorem 5 

The proof of theorem 5 relies on properties involving the notions of active or 
inactive steps for derivations, as defined by Dershowitz [7]. 

Definition 9. The active area of a ground word Wi in a ground derivation 
wi W 2 ^ ‘ ‘ ‘ ^ Wn is that part of Wi that has been created by the nonvariable 
portions of the righthand sides of the rules that have been applied. Only the top 
letter (i.e., the rightmost letter) of the initial word wi is considered active. 

More precisely, suppose that a rule of the form ffXIiYff ffXnYff (resp. 

^ #ri At 2 #) is applied to a ground word w of the form (resp. 

#iiuii 2 #) to obtain a ground word w' of the form ffuiriU 2 # (resp. #riuir 2 #) . 
Then, in w' , all the letters of r\ (resp. r\ and r 2 ) are active if at least one letter 
of £i (resp. £i or £ 2 ) was active; besides, all the letters of u\,U 2 that were already 
active in w remain active in w ' . We say that a ground word is active if its top letter 
is (active letters will be overlined). 



Definition 10. An active ground derivation via S (resp. inactive ground deriva- 
tion via S) is a ground derivation ici ^ u ;2 ^ ^ rcn in which rules of S are 

applied only in the active area (resp. inactive area) of words. 

We will denote by (resp. an application of a rule at an active area 

(resp. inactive area) of a ground word. 

We now give a property A.l, which is the counterpart of the relation between 
reduction sequences and narrowing sequences in first-order terms theory [19], and 
a property A. 2, which was proved by Dershowitz [7]. 

Property A.l (lifting lemma). 

For all active ground derivation A : w\ V 02 vOn via S, there exists 

a top chain C : t\ ^ t 2 ^ ^ tn via S which has A as an instance (i.e. such 

that wi = tiO,W 2 = t20, ...,Wn = tnO, for some ground substitution 0). 

Property A. 2 (semi-commutation). 

Let wi,W 2 :W 3 be ground words such that w± V 02 V 03 . Then there exists 

a ground word W 2 such that w\ ^ W 2 ^ 

We finally give two properties A. 3 and A. 4, which are adapted from [7] in our 
context. 

Property A. 3. Let S' be a rewrite sytem such that S — Tops is terminating. Then 
any infinite ground derivation via S has only a finite number of inactive steps. 
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Proof. Consider an infinite ground derivation A : ici ^ • • • — > Wn • via S. 
Applying a rule at an active area of a ground word cannot create any new inactive 
letters, while applying a rule at an inactive area only replaces a certain portion 
of inactive area by another inactive portion of the same length. Therefore either 
(a) all the inactive subareas of A disappear after a finite number of steps, or (b) 
at least one of them remains, but between two fixed positions. In case (5), there 
is a subpart A' of A of the form Wi ^ ^ Wn ^ , such that every Wn is 

of the form XnVnVn^ all the XnS (resp. I’n’s, iCn’s) have the same length, and the 
r»n’s always remain inactive. Since the active application of rules over subparts 
of Xn or ijn have no influence over the inactive portion I’n, one can extract from 
A' an infinite inactive ground derivation A" affecting only the I’n’s. This infinite 
derivation A" never makes use of a top rule since the top letter is active. This is 
in contradiction with assumption that S — Tops is terminating. The only possible 
case is therefore (a): all the inactive subareas disappear after a finite number of 
steps. 

Property A. 4. Suppose that S — Tops is terminating, and C is closed via S~^ . 
Then there is an infinite ground derivation via S starting from a word re 0 £ if 
and only if there is a quasi-cyclic top chain via S', starting from a word t such that 
to ^ C for some ground substitution 0. 

Proof. The if part is obvious. To prove the only-if part, consider an infinite ground 
derivation A starting from a word w ^ C. By property A. 3 (since S — Tops is 
terminating) this infinite derivation contains only a finite number of inactive steps. 
Applying iteratively property A. 2, one can push back these inactive steps to the 
beginning of the derivation, thus obtaining a reordered infinite ground derivation 
A' . Frome some point on, there are only active steps in derivation A' . Let A" : 
Wi Wi^i ^ • denote this active infinite ground part of derivation. Since C is 

closed via S~^ ^ we must have Wi 0 C. Besides, since the rules are length-preserving, 
there is an initial part of A'' of the form Wi ^ ^ wj ^ ^ Wn such that 

Wj = Wn and Wp ^ Wq for all p < q < n. By the lifting lemma, there is a top 
chain C : ti ^ tj ^ ^ tn with CO = A, for some ground substitution 

0. In particular tjO = Wj = Wn = tnO. But tj and tn are either both ground or of 
the form ffujWvjff and ffunWvn# with \uj\ — \un\ and \vj\ — \vn\- So tj — tn 
follows from tjO = tnO. Besides, for all p < q < n, tp and tq are distinct since 
their instances Wp^ Wq via 0 are distinct. Therefore ^ ^ tj ^ ^ tn is a 

quasi-cyclic top chain, starting from ti with tiO = wi ^ C. 

Now theorem 5 directly results from property A. 4 above and the definition of 
self- stabilization. 
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Abstract. Program refinements from an abstract to a concrete model 
empower designers to reason effectively in the abstract and architects to 
implement effectively in the concrete. For refinements to be useful, they 
must not only preserve functionality properties but also dependability 
properties. In this paper, we focus our attention on refinements that 
preserve the property of stabilization. 

We distinguish between two types of stabilization-preserving refinements 
— atomicity refinement and semantics refinement — and study the for- 
mer. Specifically, we present a stabilization-preserving atomicity refine- 
ment from a model where a process can atomically access the state of 
all its neighbors and update its own state, to a model where a process 
can only atomically access the state of any one of its neighbors or atom- 
ically update its own state. (Of course, correctness properties, including 
termination and fairness, are also preserved.) 

Our refinement is based on a low- atomicity, bounded-space, stabiliz- 
ing solution to the dining philosophers problem. It is readily extended 
to: (a) solve stabilization-preserving semantics refinement, (b) solve the 
drinking philosophers problem, and (c) allow further refinement into a 
message-passing model. 



1 Introduction 

Concurrent programming involves reasoning about the interleaving of the execu- 
tion of multiple processes running simultaneously. On one hand, if the grain of 
atomic (indivisible) actions of a concurrent program is assumed to be coarse, the 
number of possible interleavings is kept small and the program design is made 
simple. On the other hand, if the program is to be efficiently implemented, its 
atomic actions must be fine-grain. This motivates the need for refinements from 
high- atomicity programs to low-atomicity programs. 
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Atomicity refinement must preserve the correctness of the high-atomicity pro- 
gram. In other words, the safety (e.g., invariants) and the liveness (e.g., termina- 
tion and fairness) properties of that program must also hold in the corresponding 
low-atomicity program. But it is also important to preserve the non-functional 
properties of the high- atomicity program. In this paper, we concentrate on re- 
finements that, in addition to preserving functionality, preserve the property of 
stabilization. 

Informally speaking, stabilization of a program with respect to a set of legit- 
imate states implies that upon starting from an arbitrary state, every execution 
of the program eventually reaches a legitimate state and thereafter remains in 
legitimate states. It follows that a stabilizing program does not necessarily have 
to be initialized and is able to recover from transient failures. 

To be fair, our notion of stabilization-preservation atomicity refinement should 
be distinguished from what we call stabilization-preservation semantics refine- 
ment: 

— Atomicity refinement. In this case, the atomicity of program actions is refined 
from high to low, but the semantics of concurrency in program execution 
is not. For instance, both the high- and low- atomicity programs may be 
executed in interleaving semantics, where only one atomic operation may be 
executed at a time. Alternatively, both programs may be executed in power- 
set semantics, where any number of processes may each execute an atomic 
action at a time, or both in partial-order semantics, etc. 

— Semantics refinement. In this case, the semantics of concurrency in program 
execution is refined, but the program atomicity is not. For instance, a pro- 
gram in interleaving semantics may be refined to execute (with identical 
actions) in power-set semantics [2]. The program is more easily reasoned 
about in the former semantics, but more easily implemented in the latter. 

An elegant solution for a semantics refinement problem has been proposed 
by Gouda and Haddix [12]. Their solution does not however achieve atomicity 
refinement. In this paper, by way of contrast, we focus on an atomicity refinement 
problem. But, as an aside, we demonstrate that our solution is applicable for 
semantics refinement as well. 

Specifically, we consider atomicity refinement from a model where a process can 
atomically access the state of all its neighbors and update its own state, to 
a model where a process can only atomically access the state of any one of its 
neighbors or atomically update its own state. (We also address further refinement 
to a message-passing model.) In all models, concurrent execution of actions of 
processes is in interleaving semantics. 

As can be expected, the straightforward division of high-atomicity actions 
into a sequence of low-atomicity actions does not suffice because each sequence 
may not execute in isolation. A simple strategy for refinement, therefore, is to 
execute each sequence in a mutually exclusive manner. Of course, the mecha- 
nism for achieving mutual exclusion has to be (i) itself stabilizing, in order for 
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the refinement to be stabilization-preserving, (ii) in low-atomicity, since the re- 
fined program is in low atomicity, and (iii) bounded space, to be implemented 
reasonably. 

This simple strategy unfortunately suffers from loss of concurrency, since 
no two processes can execute sequences concurrently even if these sequences 
operate on completely disjoint state spaces. We are therefore led to solving the 
problem of dining philosophers, which requires mutual exclusion only between 
“neighboring” processes, and thus allows more concurrency. 

Although there is a number of stabilizing mutual exclusion programs in the 
literature [4,5,11,15,18], none of them is easily generalized to solve dining philoso- 
phers. Mizuno and Nesterenko [17] consider dining philosophers in order to solve 
a problem that has a flavor of atomicity refinement, but their solution uses infi- 
nite variables. It is well-known that bounding the state of stabilizing programs is 
often challenging [3]. This motivates a new solution to the dining philosopher’s 
problem which satisfies the requirements (i)-(iii) above. 

Other notable characteristics of our refinement include: It is sound and com- 
plete] i.e. every computation of the low- atomicity program corresponds to a 
unique computation of the high-atomicity program, and for every computation of 
the high-atomicity program there is a computation of the low-atomicity program 
that corresponds to it. It is fixpoint-preserving] i.e., terminating computations 
of the high-atomicity program correspond only to terminating computations of 
the low atomicity program. It is fairness-preserving] i.e., weak- fairness of action 
execution is preserved, which intuitively implies that the refinement includes a 
stabilizing, low- atomicity weakly- fair scheduler. 

We describe further refinement into a message-passing model. An (unbounded 
space) transformation from high-atomicity model into message-passing model is 
presented in [16]. Our solution has bounded space complexity. 

The rest of the paper is organized as follows. We define the model, syntax, and 
semantics of the programs we use in Section 2. We then present a low-atomicity 
bounded-space dining philosophers program and prove its correctness and stabi- 
lization properties, in Section 3. Next, in Section 4, we demonstrate how a high- 
atomicity program is refined using our dining philosophers program, and show 
the relationship between the refined program and the original high- atomicity 
program in terms of soundness, completeness, and flxpoint- and fairness- preser- 
vation. We summarize our contribution and discuss extensions of our work in 
Section 5. 

2 Model, Syntax, and Semantics 

Model. A program consists of a set of processes and a binary reflexive symmetric 
relation N between them. The processes are assumed to have unique identifiers 1 
through n. Processes Pi and Pj are called neighbor processes iff (Pi,Pj) G N. 
Each process in the system consists of a set of variables, set of parameters, and 
a set of guarded commands (GC). 
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Syntax of high-atomicity programs. The syntax of a process Pi has the 
form: 



process Pi 

par (declarations) 
var (declarations) 

*[ 

(guarded command) []...[] (guarded command) 



Declarations is a comma-separated list of items, each of the form: 

{list of names) : (domain) 

A variable can be updated (written to) only by the process that contains the 
variable. A variable can be read by the process that contains the variable or by 
a neighbor process. We refer to a variable v that belongs to process Pi as Vi. 

A parameter is used to define a set of variables and a set of guarded com- 
mands as one parameterized variable and guarded command respectively. For 
example, let a process Pi have parameter j ranging over values 2,5, and 9; then 
a parameterized variable x.j defines a set of variables {x.j \ j G {2,5,9}} and a 
parameterized guarded command GC.j defines the set of GCs: 

GC.{j := 2) [] GC.{j := 5) [] GC.{j := 9) 

A guarded command has the syntax: 

(guard) — > (command) 

A guard is a boolean expression containing local and neighbor variables. A com- 
mand is a finite comma separated sequence of assignment statements updating 
local variables and branching statements. An assignment statement can be sim- 
ple or quantified. A quantified assignment statement has the form: 

(W(range) : (assignments)) 

quantification is a bound variable and the values it contains. Assignments is 
a comma separated list of assignment statements containing the bound vari- 
able. Similar to parameterized GC, a quantified statement represents a set of 
assignment statements where each assignment statement is obtained by replac- 
ing every occurrence of the bound variable in the assignments by its instance 
from the specified range. 

Syntax of low-atomicity programs. The syntax for the low- atomicity pro- 
gram is the same as for the high atomicity program with the following restric- 
tions. The variable declaration section of a process has the following syntax: 
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var 

private (declarations) 
public (declarations) 



A variable declared as private can be read only by the process that contains 
this variable. A public variable can also be read by a neighbor processes. A 
guarded command can be either synch or update. A synch GC mentions the 
public variables of one neighbor process and local private variables only. An 
update GC mentions local private and public variables. 

Let Vi be a private variable of Pi and Vj a public variable of Pj. We say 
that Vi is an image of Vj if there is a synch guard of process Pi that is enabled 
when Vi ^ Vj and which assigns Vi := Vj and Vi is not updated otherwise. The 
variable which value is copied to the image variable is called a source of the 
image. 

Semantics. The semantics of high- and low- atomicity programs is the same 
(cf. [1]). An assignment of values to variables of all processes in the concurrent 
program is a state of this program. A GC whose guard is true at some state of 
the program is enabled at this state. A computation is a maximal fair sequence 
of steps such that for each state Si the state is obtained by executing the 
command of some GC that is enabled at Si. The maximality of a computation 
means that no computation can be a proper prefix of another computation and 
the set of all computations is suffix-closed. That is a computation either termi- 
nates in a state where none of the GCs are enabled or the computation is infinite. 
The fairness of a computation means that no GC can be enabled in infinitely 
many consequent states of the computation. A boolean variable is set in some 
state s if the value of this variable is true in s, otherwise the variable is cleared. 

A state predicate (or just predicate) is a boolean expression on the state of a 
program. A state conforms to some predicate if this predicate has value true at 
this state. Otherwise, the state violates the predicate. By this definition every 
state conforms to predicate true and none conforms to false. 

Let 7^ be a program and R and S be state predicates on the states of V. R is 
closed if every state of the computation of V that starts in a state conforming R 
also conforms to R. R converges to S' in 7^ if is closed in 7^, S is closed 
in 7^, and any computation starting from a state conforming to R contains a 
state conforming to S. If true converges to i?, we say that R just converges. V 
stabilizes to R iff true converges to R in V. In the rest of the paper we omit 
the name of the program whenever it is clear from the context. 

3 Dining Philosophers Program 

3.1 Description 

The dining philosophers problem was first stated in [8] . Any process in the system 
can request the access to a certain portion of code called critical section{C^). 
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The objective of the algorithm is to ensure that the following two properties 
hold: 

safety no two neighbor processes have guarded commands that execute CS 
enabled in one state; 

liveness a process requesting to execute CS is eventually allowed to do so. 



process Pi 

par j : € N 

var 

public 

ready i : boolean, 
CH-j,Ci.j : (0..3) 
private 

request^ : boolean, 
: boolean, 
hi.j.di.j : (0..3) 



(dpi) request ■ A ^readyi A (V/c : ai.k = di.k) A (V/c > i : ^y%.k) — > 

readyi true, 

(||A; > i : yi.k ri.k, at.k {at.k + 1) mod 4) 

D 

(dp2) ready i A (V/c \ ai.k — di.k) A (V/c < i : ^ri.k) — ^ 

/ * critical section * / 
readyi false, 

(||A; < i : ai.k := {ai.k + 1) mod 4) 

D 

(dp3) Ci-j ^ bi-j ^ 

Ci-j ■■= h.j 

[] 

(dp4) n.j ^ readyj V {h-j ^ aj.i) V {di.j ^ Cj.i) V {j > i A ^readyj A yi.j) — > 

n.j := readyj, 
bi.j := aj.i, 

di.j := Cj.i, 

i{j > i A ^readyj A yi.j then yi.j false fi 



Fig. 1. Dining philosophers process 



This section describes a program W that solves the dining philosophers 
problem. Every process Pi of VV is shown in Figure 1. To refer to a guarded 
command executed by some process we attach the process identifier to the name 
of the guarded command shown in Figure 1. For example, guarded command dpli 
sets variable readyi. We sometimes use GCs identifiers in state predicates. For 
example, dpli used in a predicate means that the guard of this GC is enabled. 
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Every Pi has the following variables: 

— request^ - abstracts the reaction of the environment. It is a read-only vari- 
able which is used in program composition in later sections. Pi wants to 
enter its CS if request^ is set. 

— ready i - indicates if Pi tries to execute its CS. Pi is in CS contention if 
ready i is set. 

— ri.j - records whether Pj is in CS contention, it is an image of ready j. 

— yi.j - records if Pj requests CS and needs to be allowed to access it before Pi 
can request its own CS again. It is maintained for each Pj such that j > i; 
it is called the yield variable. 

— ai.j, bi.j, Ci.j, di.j - used for synchronization between neighbor processes; they 
are called handshake variables. 

The basic idea of the program is: among the neighbor processes in CS con- 
tention the one with the lowest identifier is allowed to proceed. To ensure fairness, 
when a process joins CS contention it records the neighbors in CS contention 
with ids greater than its own; after the process exits its CS it is not allowed to 
request CS again until the recorded neighbors enter their CS. 

Let us consider neighbor processes Pi and Pj and the following sequence of 
handshake variables Hij = (a^.j, bj.i^ Cj.i, di.j). We say that a^.j has a token 
if ai.j is equal to di.j. We say that any of the other variables has a token if it 
is not equal to the variable preceding it in Hij. VP is constructed so that Hij 
forms a ring similar to the a ring used in K-state stabilizing protocol described 
in [9]. 

Every Pi has the following GCs: 

dpli - update GC. When Pi wants to enter GS, it is not in GS contention, 
for every neighbor Pj, ai.j has a token, and yield variables for processes 
with identifiers greater than i are not set; then Pi sets ready i joining GS 
contention; Pi also sets yield variables for processes who are in GS contention 
and increments ai.j for every Pj with identifier greater than i passing the 
token in the handshake sequence. 

Note, that when ai.j collects the token again Pj is informed of P^’s joining 
GS contention, that is rj.i is set. This ensures safety of the program. 
dp2i - update GC. When Pi is in CS contention, every ai.j has the token and 
processes with smaller identifiers are not in CS contention then it is safe 
for Pi to execute its CS. Note, that critical section is put in comments in 
Eigure 1 since no actual CS is executed. Pi clears ready i and increments ai.j 
for every Pj with identifier less than i, again passing the tokens. 

Note, that when the tokens are collected every Pj is informed that Pi exited 
CS and yield variable yj.i is cleared. This ensures liveness of the program: Pi 
cannot repeatedly enter CS while Pj is stuck with yj.i set. 
dp3i - update GC. It sets Ci.j equal to bi.j thus passing the token from c to d. 
dpAi - synch GC. It passes the tokens from bi.j to q.j and from di.j to ai.j. 
dpAi also copies the value of ready j to it’s image ri.j and clears the yield 
variable yi.j when Pj is not in CS contention. 
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3.2 Proof of Correctness of 'D'P 

Due to space limitations some of the proofs of the theorems stated are omitted. 
All proofs are available in the full version of the paper [19]. 

Stabilization 

Let Pu and Py be any two neighbor processes. 

Proposition 1. VP stabilizes to the following predicate: 

there can be one and only one token in Huv {Ri) 

Lemma 1. VP stabilizes to the following predicates: 

((u < v) A (au‘V = by.u) A readyu) ^ Vy.u {R2) 

{{u > v) A (au.v = by.u) A ^readyu) ^ ^Vy.u (Rs) 

Lemma 2. VP stabilizes to the following predicate: 

{{u > v) A {au-v = by.u) A -^readyu) ^ ^y^u {R^) 

We now define a predicate I dp (which stands for invariant of VP) such 
that every computation of VP that starts at a state conforming to I dp satisfies 
safety and liveness. I dp is: for every pair of neighbor processes R1AR2AR3AR4. 
In other words in every state conforming to I dp ^ every pair of neighbor processes 
conforms to every predicate form the above list. 

Theorem 1. VP stabilizes to I dp- 

Thus every execution of VP eventually reaches a state conforming to Idp- 
In the next two subsections we show that every computation that starts from a 
state conforming to I dp satisfies safety and liveness properties. 

Safety 

Theorem 2 (Safety). In a state conforming to no two neighbor processes 
have their guarded commands that execute critical section enabled. 



Liveness 

For a process Pu and its neighbor Py the value of the variable au-v is changed only 
when all a variables of process Pu have their tokens. The following observation 
can be made on the basis of Proposition 1. 

Proposition 2. All a variables of a process eventually get the tokens. That is 
a state conforming to: 3v : (Py^Pu) G N : au-v ^ du-v is eventually followed by 
a state where: Vu : {Py^Pu) ^ N \ au-v = du-v 
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Lemma 3. If a process Pu is in CS contention it is eventually allowed to execute 
its CS. 

Lemma 4. If a process Pu wants to enter its CS it eventually joins CS con- 
tention. 

The following theorem unifies Lemmas 3 and 4. 

Theorem 3 (Liveness). If /dp holds, a process that wants to enter its CS is 
eventually allowed to do so. 

4 The Refinement 

4.1 High- Atomicity Program 

process Pi 
var Xi 

*[ 

(hi) gi{xi, {xk I (Pi,Pk) e N}) — > Xi := fi{xi, {xk I (Pi,Pk) e N)) 



Fig. 2. High- atomicity process 



Each process Pi of high- atomicity program (TL) is shown in Figure 2. To 
simplify the presentation we assume that Pi contains only one GC. We provide 
the generalization to multiple GCs later in the section. Each Pi of Ti contains 
a variable Xi which is updated by hli. The type of Xi is arbitrary. The guard 
of this GC is a predicate gi that depends on the values of Xi and variables of 
neighbors processes. The command of hli assigns a new value to Xi. The value 
is supplied by a function fi which again depends on the previous value of Xi as 
well as on the values of the variables of the neighbors. Recall, that unlike low- 
atomicity program such as a GC of H can read any variable of the neighbor 
process and update its own variable in one GC. 

4.2 Composing DV and 7i 

To produce the refinement C of 7/ we superpose additional commands on the GCs 
of W and demonstrate that C is equivalent to H. Superposition is a type of pro- 
gram composition that preserves safety and liveness properties of the underlying 
program {VP). C consists of superposition variables, superposition com- 
mands and superposition GCs. The superposition variables are disjoint from 
variables of VV. Each superposition command has the following form: 

( GC of W) II {command) 
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The type of combined GC (synch or update) is the same as the type of the GC 
of VV. The superposition commands and GGs can read but cannot update the 
variables of VV. They can update the superposed variables. Operationally speak- 
ing a superposed command executes in parallel (synchronously) with the GC 
of VV it is based upon, and a superposed GC executes independently (asyn- 
chronously) of the other GGs. Refer to [7] for more details on superposition. Su- 
perposition preserves liveness and safety properties of the underlying program. 
In particular, if R is stabilizing for VV it is also stabilizing for C. Thus, /dp is 
also an invariant of C. 



process Pi 

par j : {Pi,Pj) € N 

var 

public Xi 
private 

Xi.j, 

request^ : boolean 

*[ 

(cl) dpi 

D 

/ if gi{xi,{xi.k \ (Pi,Pk) e N)) \ 

(c2) dp2 II I then Xj := fi(xi, {xi.k \ (Pi,Pk) € N)) fi j 

y request^ := false J 

D 

(c3) dp3 

D 

(c4) dpA II ( if Xi.j ^ x.j then Xi.j := x.j, request^ := true fi ) 

D 

(c5) Xi.j ^ X.j — ^ Xi.j := x.j, request ■ := true 

D 

(c6) gi{xi, {xi.k I {Pi,Pk) e N)) A ^request . — ^ 

request^ := true 



Fig. 3. Refined process 



Each process Pi of the composed program (C) is shown in Figure 3. For 
brevity, we only list the superposed variables in the variable declaration section. 
Besides the Xi we add Xi.j which is an image of Xj for every neighbor Pj. Su- 
perposed variable request^ is read by VV. Yet it does not violate the liveness 
and safety properties of VV since no assumptions about this variable was used 
when these properties were proven. 

The GGs of VV are shown in abbreviated form. We superpose the execution 
of hi on dp2. Note that c2 is an update GC. Therefore, the superposed command 
cannot read the value of Xj of a neighbor Pj directly as hi does. The image Xi.j is 
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used instead. We superpose copying of the value of Xj into Xi.j on dpA. Thus, the 
images of neighbor variables of H are equal to the sources when hi is executed 
by C. We add a superposition GC c5 that copies the value of Xj into Xi.j. 
This GC ensures that no deadlock occurs when an image is not equal to its 
source, request^ is set when one of the images of the superposed variables is 
found to be different from the sources or when the guard of hi evaluates to 
true (c6). request- is cleared after hi is executed. 

So far we assumed that Tl has only one GC. The refined program can be 
extended to multiple GCs. In this case, c2 has to select one of the enabled GCs 
of H and execute it. c6 has to be enabled when at least one of the GCs of H is 
enabled. We prove the correctness of C assuming that H has only one GC. In 
a straightforward manner, our argument can be extended to encompass multi- 
ple GCs. 

4.3 Correctness of the Refinement 

As with correctness proof of VV we have to omit proofs of the theorems stated 
in this section due to space limitations. All proofs are available in the full version 
of the paper [19]. 

Throughout this section we assume that Pu and Py are neighbor processes. 

Lemma 5. C stabilizes to the following predicates: 

{{u < v) A (ttu-v = du-v) A readyu) => = x„) {R^) 

{{u > v) A -^ru-v) => {xu-v = Xy) {Re) 

The following corollary can be deduced from the lemma. 

Corollary 1. 11 1 dp holds, c2 is executed only when the images of the neighbor 
variables are equal to the sources. That is: 

V(P„, Pv) e N : c2„ => {xu-v = x.v) 



Lemma 6. C stabilizes to the following predicates: 

{{u <v)/\ {au-v = by.u) A ^readyu) ^ {xu = Xy.u) {Rj) 

{{u > v) A {au-v = by.u) A readyu) ^ (^u = Xy.u) (Rg) 

We define the invariant for C (denoted Ic) to be the conjunction of Idp^ ^ 5 , 
Rq^ i? 7 , and Rg. On the basis of Theorem 1, Lemma 5, and Lemma 6 we can 
conclude: 

Theorem 4. C stabilizes to Ic- 

Recall, that a global state is by definition an assignment of values to all the 
variables of a concurrent program. If a program is composed of several component 
programs, then a component projection of a global state s is a part of s consisting 
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of the assignment of values to the variables used only in one of the components of 
the program. Stuttering is a sequence of identical states. A component projection 
of a computation is a sequence of corresponding component projections of all the 
states of the computation with finite stuttering eliminated. Note, that projection 
of a computation does not eliminate an infinite sequence of identical states. When 
we discuss a projection (of a computation or a state) of C onto H we omit H 
and just say a projection of C. A fixpoint is a state where none of the GCs of 
the program are enabled. Thus, a computation either ends in a fixpoint or it is 
infinite. 

Proposition 3. Let s be a fixpoint of C. The following is true in s: 

— Uu-V = by.U = Cy.U = du-V 

— Vu-v = ready y 

— readyu is cleared; 

— \iu <v then yu-v is cleared; 

— Xu-V = Xy 



Theorem 5 (Fixpoint preservation). When Ic holds, a projection of a fix- 
point of C is a fixpoint of H; and if a computation of C starts from a state which 
projection is a fixpoint of H then this computation ends in a fixpoint. 

Let ac and cfh be computations of C and H respectively. 

Lemma 7. If Ic holds and hlu continually enabled in the projection of ac, 
then hlu is eventually executed in ac- 



Theorem 6 (Soundness). If a computation of C, ac starts at a state where Ic 
holds, then the projection of ac, cfh is a computation of H. 

We call a state s of C elean if for any process Pu^ readyu is cleared and the 
only guard that is possibly enabled at s is cG^. Let u < v. In a clean state only 
c6u be enabled in Pu. Thus the following should also be true in every clean state: 

— the token is held by that is: au-v = by. u = Cy.u = dyV; 

— since ready y is cleared, ru-v and yu-v are also cleared; 

— request^ is cleared. 



Theorem 7 (Completeness). For every computation an there exists a com- 
putation of ac the projection of which is an- 

The completeness proof is based on the idea that for pair of states Si and 
of ch we can find a clean state s'- of C whose projection is Si and a sequence 
of states that leads to a clean state whose projection is We compose 
such sequences into ac and show that the execution thus created is maximal 
and fair. 
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5 Extensions and Concluding Remarks 

In this paper we presented a technique for stabilization-preserving atomicity re- 
finement of concurrent programs. The refinement enables design of stabilizing 
programs in simple but restrictive model and implementation in a more complex 
but efficient model. Our refinement is based on a stabilizing, bounded-space, din- 
ing philosopher program in the more complex model. It is sound and complete, 
and fixpoint- and fairness-preserving. 

In conclusion, we discuss three notable extensions of our refinement. 

5.1 Semantics Refinement 

Consider the semantics refinement problem where the abstract model employs 
interleaving semantics and the concrete model employs power-set semantics. We 
show how our refinement can be used to solve this problem. To demonstrate 
that our atomicity refinement is applicable to semantics refinement for power- 
set semantics we show that for a low-atomicity program a power-set computation 
is equivalent to an interleaving computation. 

Two computations are equivalent if in both computations every Pu executes 
the same sequence of GCs and when a GC executes the values of the variables it 
reads are the same. Recall that in a computation under interleaving semantics 
(interleaving computation) each consequent state is produced by the execution 
of one of the GG that is enabled in the preceding state. In a computation under 
power-set semantics (power-set computation) each consequent state is produced 
by the execution of any number of GGs that are enabled in the preceding state. 

Theorem 8. For every power-set computation of a low- atomicity program there 
is an equivalent interleaving computation. 

Proof: To prove the theorem it is sufficient to demonstrate that for every pair 

of consequent states (si, S2) of power-set computation there is an equivalent 
sequence of states of interleaving computation. The GGs executed in si are 
either synchs or updates. Glearly, if the synchs are executed one after another 
and followed by the updates the resulting interleaving sequence is equivalent to 
the pair (51,82). □ 



5.2 Generalization to Drinking Philosophers Problem 

Our refinement solution can be generalized to drinking philosophers problem [ 6 ] 
to further increase concurrency of the computation of the program. In the argu- 
ment below we assume that Tl has multiple GGs. GGs of Tl eonfliet (affect each 
other) if one of them writes (updates) the variables the other GG reads. W 
enforces MX of execution of GGs of H among neighbor processes. This is done 
regardless of whether these GGs actually conflict. 

In VV to ensure MX among neighbor processes every pair of neighbors Pu 
and Py maintains a sequence of handshake variables Huv • Sending a token along 
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this handshake sequence is used to inform the neighbor if the process is enter- 
ing or exiting its CS. In a similar manner, Pu and Py can have a sequence of 
handshake variables for every pair of of conflicting guarded commands. Then if 
a GC of TY gets enabled the tokens are sent along each sequence to prevent the 
conflicting guarded commands from executing concurrently. ^ In the meantime 
non-conflicting GCs can execute concurrently. 

5.3 Extension to Message-Passing Systems 

Our refinement is further extended into message-passing model where the pro- 
cesses communicate via finite capacity lossy channels. To do so the underly- 
ing W has to be modified so as it works in this model as follows. 

The sequence of handshake variables Huy between a pair of neighbors Pu 
and Py is used in VV for process Pu to pass some information to Py and get 
an acknowledgment that this information has been received. In message-passing 
systems an alternating-bit protocol (ABP) can be used for the same purpose. A 
formal model of dealing with lossy bounded channels in message-passing systems 
as well as a stabilizing ABP is presented in [14]. 

In this case Pu sends the value of a handshake variable (together with the rest 
of its state) to its neighbor in a message. If the message is lost it is retransmitted 
by a timeout. When Pu receives the message it copies the state of Py (including 
the handshake variable) into its image variables and sends a reply back to Pu- 
When Pu gets the reply it knows that Py got the original message. It has been 
proven that the ABP stabilizes when the range of the handshake variables is 
greater than the sum of the capacity of the channels between Pu and Py and in 
the opposite direction [14]. 

When H reaches a flxpoint the values of the variables of processes of C ex- 
tended to message passing system do not change. Thus C is in a quiescent state. 
It is well-known [13] that a stabilizing message-passing program cannot reach a 
flxpoint. Therefore the extension of VV to message-passing systems no longer 
flxpoint-preserving: the timeout has to be executed even if the projection of the 
program has reached a flxpoint. 
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Abstract. In this paper we suggest the notion of self-testing/ correcting 
protocols. The work initiates the merge of distributed computing and the 
area of “program checking” introduced by Blum, and specifically employs 
extended notions from the work of Blum, Luby and Rubinfeld. In this 
setting, given a protocol P (a collection of programs on a network of n 
processors) which allegedly implements a distributed function /, a self- 
tester for / is a (simpler) protocol which makes calls to P to estimate 
the probability that P when executed in a given environment is faulty 
(i.e., P and / differ in some of the outputs). A self- correcting protocol 
is another protocol which allows for the computation of / correctly on 
every input (with high probability) as long as P in the same type of 
environment is not too faulty. 

We first consider self-testing/correcting under a basic form of environ- 
mental malfunction, that of crash failures, and design a self-tester /cor- 
rector pair for protocols implementing the agreement “function.” Many 
distributed protocols can be designed “on top” of this primitive, and 
can be self tested/corrected whenever it can be. We then consider self- 
testing/correcting under gossiping failures, and present a generic self- 
testing/correcting pair that is privacy -preserving. The notion is basic in 
protocols where secrecy is an issue. A self-corrector for P is privacy- 
preserving if it is private (with overwhelming probability) whenever P is 
private (with overwhelming probability). 

In the process of our study, we identify the basic components of a protocol 
self-testing “utility library,” which allows for the safe bootstrapping of 
the self- testing/correcting process. 



1 Introduction 

Traditional program testing involves running the program on a “random” subset 
of the possible inputs, for which the correct outputs are already known. There are 
several drawbacks to this approach. First, it is not clear what “random” means 
in this context (equally-likely inputs?; a distribution that arises in practice?; 
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is this distribution well defined or known at all?; etc.). Both Kannan [20] and 
Lipton [23] point to the extreme case of a program that is wrong on exactly 
one input. Such a program is extremely likely to pass the testing stage, and thus 
when the program is run on the bad instance the program’s output is accepted as 
correct. The second problem concerns the requirement that some of the outputs 
be known. How is this achieved? In some cases instances are generated that are 
“small” enough that can be checked by hand and this is insufficient to make an 
assertion about the general behavior of the program. In other cases a “different” 
program is used for the same computation. How independent is this program 
from the one being tested? How was the confidence about the correctness of this 
program itself obtained? 

Attempting to cover some of the above criticism, Blum [4] proposed a new 
framework called program result checking (program checking, for short). Program 
checking involves writing a result checker program to be run in conjunction with 
the original program, that checks the work of the program. Program checking is 
instance specific: each time the program is run on an input, the checker gives us 
confidence that the program is correct on that input. I.e., the program checker 
does not verify whether the program is correct or buggy as a whole; it verifies 
whether the program gives the correct answer on the particular inputs on which 
it is invoked. Since checking is done during each run of the program, it reme- 
dies the program verification problem in that the actual running version — i.e., 
the compiled version and the hardware on which it runs — of the program is 
evaluated. 

Building on program checking, Blum, Luby and Rubinfeld [11] introduced the 
notion of self-testing/ correcting programs. A self -tester is a (oracle) program that 
makes calls to the program P to be evaluated to estimate the probability that 
f{x) = P{x) for random x. Although a program checker can be used to verify 
whether P is correct on particular inputs, it does not provide a method for com- 
puting the correct answer when the program is faulty. A s elf- corrector coTiects P 
as long as P’s probability of error, estimated by the self-tester, is sufficiently low. 
Blum, Luby and Rubinfeld develop general techniques for constructing simple 
to program self-testing/correcting pairs for a variety of numerical problems [11]. 

Independently, Lipton [23] also discussed the concept of self-correcting pro- 
grams and used it to derive a distribution-free theory of random testing; for sev- 
eral problems he constructed testing programs with respect to any distribution 
assuming that the programs are not too faulty with respect to a particular distri- 
bution. Instrumental in both developments above is the random self-reducihility 
property. Informally, a function / is random self-reducible if, for any x, the com- 
putation of f{x) can be reduced to the computation of / on other “randomly 
chosen” inputs. 

Protocol self-testing. In this paper we extend the Blum methodology to deal with 
distributed protocols. Such protocols should be understood here as collections of 
programs running in a network of n processors, one for each processor. Moti- 
vated by [11], we introduce the notion of self -testing/ correcting protocols. Given 
a protocol P which allegedly implements a distributed function /, a self-tester 
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for / is a protocol which makes calls to P to estimate the probability that P is 
faulty (i.e., P and / differ in some of the outputs). A self- correcting protocol is 
another protocol which allows for the computation of / correctly on every input 
(with high probability) as long as P is not too faulty. 

In such an environment, other things can go wrong besides the faulty behavior 
due to flaws in the design and coding of the protocol (e.g., processor failures, 
messages delays, messages being lost, etc.), making the task of evaluating the 
correctness of a protocol more complicated than in the case of a sequential 
program. In particular, the benefit of evaluating the running version of a protocol 
and the “generalized hardware” on which it runs becomes all the more relevant 
in this context. 

As in [11], we require that the self-tester be different than any protocol im- 
plementing the function under consideration, as well as simple^ and efficient (see 
Section 2). The first requirement guarantees the independence of the verification 
step, the second (however difficult to quantify) makes it easier to assess its cor- 
rectness, while the third assures that the cost for using it is not too high (e.g., 
big slowdown in the running time) so as to overwhelm the benefits of using it. 

We first consider (Section 3) self-testing/correcting under the basic form of 
environmental malfunction of crash failures, and design a self-tester /correct or 
pair for the agreement “function” (aka, Byzantine agreement. Distributed Con- 
sensus [24]). 

In [11,25], Blum et al discuss the notion of using several programs — a li- 
brary, all of which are possibly faulty, to aid in the testing and correcting. In 
the process of designing our self-tester /corrector pair for agreement, we identify 
some of the basic components (co-routines) that a protocol self-testing “util- 
ity” library should contain; besides agreement, these include a shared random 
coin (e.g., [13]) and Crusader agreement [14]. Such a protocol library, together 
with [11] ’s program library, enable the bootstrapping of self-testing/correcting 
for other protocols. 

One important class of protocols are those which enable secure multi-party 
computation [9,12]. Besides the correctness requirement, such protocols are re- 
quired to be t -private] informally, this means that an adversary viewing the 
transcripts of any t parties learns nothing about the computation beyond what 
can be inferred from their inputs together with the result. From a security 
standpoint, it is important that self-testers and self-correctors do not create 
new vulnerabilities. We also present in this paper (Section 4) a generic self- 
testing/correcting pair that is privacy-preserving: if a protocol is t-private, then 
so is its self-testing/correcting pair. 



Related work. To the best of our knowledge, this is the first time that the self- 
testing/correcting of protocols is discussed. 

Program checking and self-testing have been extensively investigated, and 
checkers and self-testers are now known for many problems [10,2,19,26,27,17]. 
Other relevant works include checking the correctness of memory [6], crypto- 
graphic program checking [18], and batch checkers [8]. 
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In the related but different area of self- stabilizing protocols (e.g., [1,3,15,22]) 
the notion of “checking” has been considered as an important co-routine. This 
setting assumes a “correct protocol control” constantly checking the state of 
the execution, which stabilizes when faults cease to exist. The self- stabilizing 
approach as a whole does not propose a “tool” for reliability; rather, it is a “pro- 
tocol design methodology” for certain faulty environments. Our setting is quite 
different: we apply our approach as a reliability tool which checks and actively 
corrects outputs in the presence of faulty control and executional environment. 

2 Model and Definitions 

The model we consider in this paper is a synchronous, fully connected network 
of n processors/parties with identities from {1, 2, • • • , n}. 

The literature on checking calls the problem specification a “function” while 
the implementation is called a “program.” In a distributed environment, a func- 
tion may admit more than one outcome depending on the environment’s “input.” 
This is a behavior which can be specified using relations; however, for consis- 
tency with the existing theory of sequential function self-testing/correcting, we 
continue to model this process using functions, as explained in the following. 

Distributed functions and protocols. To model the computation we define an 
ideal distributed function (f : for arbitrary domain V and range R] 

typically, V = R. A prime example of such a function is the addition function 
where the inputs are values from a given group and the result is the value of the 
sum of the inputs. Many arithmetic functions like addition have been treated in 
the sequential case of self-testing/correcting programs. 

In a distributed environment, the behavior of individual processors and the 
communication environment may deviate from an ideal execution. This gives 
rise to specifying problems under certain assumptions on (partial properties of) 
the “behavior.” Many times we specify that depending on this behavior the 
computation will take into account only a partial portion of the inputs, or we 
specify that only the results which are computed by the non-faulty processors 
be considered. For example, a “distributed addition” in a faulty environment 
will specify that we sum up only the values of processors that are non-faulty 
(up to a certain time), and that the summation result will be written by all the 
non-faulty processors. 

We model this process through the notion of an environment- aware dis- 
tributed function / : W'^ S'^ . where / corresponds to 0, and they agree 

on ideal environments. W is an extended set of inputs which includes the actual 
input set V, but also relevant environmental characteristics that tag each proces- 
sor; these characteristics infiuence the outcome of the function. S extends in a 
similar fashion. A protocol for a distributed function 0 is a distributed program 
which implements /. 

We use £ to denote the set of environments where the protocols run. Ex- 
amples of sets of environments are adversarial settings where processors may 
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fail; settings where messages may be lost or delayed; etc. We use the expression 
“running in 5” as a shorthand for “running in environments from implies 

the relevant environmental characteristics in the specification of the extended 
input/output pair W/ though £ may have additional elements and events. For 
example, we may only be interested in the case of when a processor has failed 
(e.g., before or after a certain round) in order to characterize the set of relevant 
inputs; we may be interested only in whether a processor has failed or not in 
order to specify a condition on its output; we may not be interested in what a 
failed processor has written in its internal state; etc. 

Oracle protocols. We will now discuss a notion of protocol extension where one 
protocol employs another protocol as an oracle (I/O access only). This is anal- 
ogous to the formalization of program extension in [11]. 

Definition 1 (Oracle protocol). A protocol V is an oracle protocol if it makes 
calls to another protocol P that is specified at run time. When a call is made 
to P, each party learns only its own output, and not any intermediate messages 
or computations of P (i.e., the call is %lack-hoP^). We letV^ denote protocol V 
making calls to protocol P. 

We let Dy denote the probability distribution according to which the in- 
puts Vi, 1 < i < n, are drawn from V. If P is a protocol, then we denote P[i] 
the output of P on processor i. Let error(/, P, Py , f ) denote the probability 
that / 7 ^ P (i.e., f[i] ^ P[i] for at least some i) on arguments randomly chosen 
according to probability distribution Dy running in £. Let /3 be a confidence 
parameter. 

Definition 2 (Self-testing protocol). Let 0 < ei < 62 < 1. An (ei, e 2 )-self- 
testing protocol for f with respect to Dy in £ is a probabilistic oracle protocol Tf 
that for any P running in £ and (3: 

1. 7/error(/,P,Py,f) < ei then Pr(Vi : rf[i] = ^PASS”) >1-/3; 

£ z/error(/,P,Py,f) > 62 then Pr(Vi : Tf\i] = ^PAIL”) >1-/3. 

Definition 3 (Self-correcting protocol). Let 0 < e < 1. An e-self-correcting 
protocol for f with respect to Dy in £ is a probabilistic oracle protocol Cf that 
for any P running in £ and (3: 

If eTTOx{f, P, Dy,£) < e then Pr(Vi : C^[i] = f[i]) > 1-/3. 

A self-testing/ correcting pair for / in f is a pair of probabilistic oracle pro- 
tocols (Tf,Cf) such that there are constants 0 <ei<e 2 <e<l such that 7} is 
an (ei, e 2 )-self-testing protocol with respect to Dy, and Cf is an e-self-correcting 
protocol with respect to Dy. 

As in [11], ei =0 is sufficient for assuming that / will be computed correctly. 
On the other hand, e 2 should be as close to ei as possible in order to allow 
protocols that are as “buggy” as possible, and still have the corrector work 
properly. 



274 



Matthew Franklin et al. 



Modes of operation. In order to allow for self-testing/correcting, the oracle pro- 
tocol is invoked a number of times. We will consider two modes in which the 
oracle protocols run: 

— The “reset” mode: The oracle calls are made sequentially and the system 
is reset with every call. Namely, the system starts from an initial state which 
is presented to the oracle. To allow the processors to make conclusions based 
on multiple oracle calls, we will assume that each processor has non-volatile 
memory preserving information across calls. 

— The “parallel” mode: The oracle calls are made in parallel. Namely, the 
processors may have a pre-processing stage after which they invoke the 
oracle, running in parallel a number of executions. 

Complexity — the ^^little oM eonstraint. As in [11], we require that the self- 
tester/corrector be quantifiably different from any correct protocol that imple- 
ments /, in the sense that their complexity, not counting the cost in the calls 
to P, should be lower than the complexity of any correct protocol for /. In this 
paper we consider synchronous networks, where the computation proceeds as 
a series of rounds. Thus, we require that the running time (number of rounds) 
of (resp., Cj) be o(P), where R is the minimum worst-case running time of 
protocols that compute /, when calls to P are counted as one round. In fact, the 
testers and correctors we present in this paper satisfy a stricter requirement, in 
that their running time is within a small multiplicative constant of the running 
time of P, when including the running time of P. 

A similar notion can be applied to communication (e.g., number of messages). 

Property-preserving oraele protoeols: When a protocol runs in £ it may satisfy 
certain properties. For example, it may tolerate a given number of failures of a 
certain type, or it may be “private” with respect to an adversary who controls 
the history of the execution and the view of the failing (gossiping) processors. 

We would like to claim that a self-testing/correcting protocol, which employs 
the protocol being tested/corrected as an oracle, satisfies the same property. This 
notion is captured by the following definition: 

Definition 4. A self-tester Tf (resp., self-eorreetor Cf) for P is p-preserving 
if'T^f (resp., Cj ) satisfies property p (with overwhelming probability) whenever P 
satisfies property r (with overwhelming probability) . 

3 Self- Test ing/Correcting under Environmental 
Malfunction 

In this section we consider environments where some of the processors in the 
network, up to t, may fail by crashing. The semantics of crash failures is the 
usual one: when a processor stops sending messages in one round, a subset of the 
processors receive the message. Initially, we assume that the failing processors are 
chosen uniformly at random among all subsets of up to t processors. This is the 
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only random choice assumption that we make. We will denote these environments 
by C. Later on we consider the standard crash failure model. 

3.1 The Agreement Function 

We now consider the design of self-tester /corrector pairs for protocols implement- 
ing the agreement function {BA) [24]. Recall that in this problem a distinguished 
source s has an initial value Vq G R. It is required that upon termination, all the 
correct processors output the same value, and if the source is non-faulty these 
output values coincide with the source’s input value, Vg- The problem can be de- 
fined as a function as follows: Let V denote the set of input and output values, 
let B = {faulty, nonfaulty} be the set of possible processor behavior patterns, 
and let T = {true, false} be the set of possible termination conditions. Then the 
agreement function BA : {V x B)'^ ^ {V x T)^ such that: 



The first condition is usually called the Validity condition, while the second is 
called the Agreement condition. The behavior components are given as an ex- 
ample; a more refined notion of behavior may specify in which round a processor 
fails, when it fails which subset of remaining processors receives its message, etc. 

We will deal with protocols that implement the agreement function, and their 
correctness. It is known that every BA protocol in the presence of crash failures 
requires > t+1 rounds of communication in its worst-case run (e.g., [16]). Even in 
benign adversarial models, like the one considered in this section, this worst-case 
scenario may happen with a very small probability. 

In order to satisfy the “little-oh” constraint, in [11] self-testing/correcting 
programs are designed with code that is simpler to implement — and verify — than 
the function being self-tested/corrected. In our case, the simple one-round Cru- 
sader agreement protocol [14] will play such a role. Recall that in this problem, 
the sender also has an initial value which it wishes to broadcast to the remaining 
processors. The goal is a protocol that guarantees the same Validity condition as 
above, but the Agreement condidion requires that no two nonfaulty processors 
decide on different values unless at least one of them discovers that the sender 
is faulty. The definition of the CA function is similar to the one above, and 
we omit it for brevity. 

To implement the above function obviously all the sender has to do is to 
broadcast its input, and hence every processor can either decide on the sender’s 
value or detect that the sender has crashed, in which case it is allowed to decide 
on a default value. That is, the problem is solvable in just one round. 



Vi s.t. hi = nonfaulty, (diAi) 



( ('i;s,true) 
{ {v^ true) 



if bg = nonfaulty; 
where G V, otherwise. 



3.2 Self- Tester for Agreement: Reset Mode 

Consider the simple oracle protocol depicted in Figure 1 for assessing one run 
of a BA protocol P. (We assume that P uses a default decision value _L G V.) 
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The tester consists of an initial CA call, followed by a call to P on a random 
input drawn according to Py , followed by another CA. In the first CA the sender 
distributes its randomly-chosen input. If the sender does not crash in this round, 
then all the non-faulty processors learn the sender’s value; on the other hand, if 
the sender crashes, every processor knows either the input or the fact that the 
sender has crashed, and hence the output of the oracle call must be the default 
value. In the second CA^ every processor i (including the sender) distributes 
P[i], the outcome obtained by running protocol P, to all other processors. Note 
that any processor participating in the second CA was not faulty during the call 
to P. Based on its local view of input and outputs, every processor judges the 
protocol by deciding on either “fail” or “not fail.” It is easy to check that the 
tests performed cover all the possibilities of protocol malfunction. 



Protocol Simple_Agreement_Tester(P, s) 

s chooses V Edy y 
input CA(s, v) 

P(s,v) 

for i = 1, • • • , n 

outputi ^ CA(i,P[i]) 

let G = {i : message received in CA{i, P[i])} 
if message received in CA{s, P[s]) 
sender-alive ^ true 

fail ^ (3 j. A; G P : output j A outputk) V 

(3 j G P : {output j ^ input A sender-alive) V 
{output j A input A outputj ^ T)) 



Fig. 1. Simple agreement tester for assessing one run; code for proc. i 



Assume that the number of failures that may occur in C is at most t < n/2. 
We can state the following about the tester of Fig. 1: 

Claim 1 If the run of P is corect, then ^"not fail” holds in all nonfaulty proees- 
sors. 



Claim 2 If the run of P is ineorreet, then Tail” holds in every nonfaulty pro- 
eessor with probability S > 1 / 2 . 

In the worst case, the misbehavior of P is witnessed by just one processor, and 
its probability of failing — and thus not communicating its findings to any other 
processor — is less than 1/2. 
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Protocol Agreement_Tester(P, e, j3) 

N ^ f In(^) 

for j = 1, • ,iV 

{ choose sender Sj 

Simple_Agreement_Tester(P, Sj) 
if fail 

S^S + 1 

} 

if — > - 

“FAIL” 
else “PASS” 



Fig. 2. Generic self-tester for agreement; code for proc. i 



We will now use the simple tester of Fig. 1 as the basis for a generic self- 
tester for BA. The self-tester is shown in Figure 2; it is presented in a “generic” 
form (e as an input parameter). The protocol basically runs the simple tester 
an adequate number of times in order to guarantee similar conclusions on the 
nonfaulty processors. In the reset mode, processors that failed are rebooted each 
time. The way the sender Sj is chosen in each run is left unspecified. In environ- 
ments from C, the same sender can be used for all calls; however, the ability to 
choose senders in different ways will allow us to cope with other types of network 
adversaries (e.g., asymmetric). 

Now to the analysis. Let Yi[i], 1 < I < A, be the 0/1 random variable 
indicating the event fail = true in the Ith. call on processor i. From Claim 2, 

/a = E[Yi[i]] = 6 • error(PA, P, Py , C) . 

The following lemma quantifies the number of runs after which the assessments 
of all the processors are sufficiently accurate. 

Lemma 1. Let 7 < 2 and A > - • Then 

Pr(Vi : (1 - 7 )/i < Y[i] < (1 + 7 )/i) > 1 - /3 , 

where Y[i] = YL=i 

The proof of the lemma follows from a slight variation of the “Zero-One Esti- 
mator Theorem” of [21] for identically-distributed 0/ 1-valued random variables 
with mean /i, and the linearity of expectation. Similarly to [25], we will make 
use of the following corollary: 
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Corollary 1 . ( 1 ) Let fi' < jii, for all i, and let N = ^ • 161n(2n//3). Then 
Pr{3i : Y\i] < fT / 2) < p. (Use 7 = 1 / 2 .; ^ 

(2) Let fi" > jii, for all i, and let N = ^ • 41n(2n//^). Then Pr{3i : Y[i] > 
2/i") <p. (Use-f = l.) 

Theorem 1. The protoeol of Figure 2 is an {^,e) -self-tester for BA. 

Proof (Sketeh). The “little-oh” requirement follows from the fact that CA re- 
quires 0(1) rounds. Regarding correctness, first note that, from Lemma 2 , the 
expected output on processor i is E[Y[i]] = /^i = 5 ’ error, ^ < S < 1. There are 
two cases to consider: 

(error > e) Let fA — \ (taking the lower bound on and = -^ • 32 ln(2n//3). 
Corollary 1 ( 1 ) yields Pr(3i : Y[i] < |) < /3. On the other hand, in the 
protocol of Fig. 2 processors outputs “FAIL” ifT[s] > |. Hence, if error > e, 
the protocol outputs “FAIL” on every processor with probability at least 
l-p. 

(error < |) Let ji" = | (corresponding to = 1) and N = ^ • 32 ln(2n//^). 
Corollary 1 ( 2 ) yields Pr(3s : T[i] > |) < p. On the other hand, if Y[i] < | 
then processor i outputs “PASS.” Hence, if error <i> the protocol outputs 
“PASS” on every processor with probability at least 1 — p. □ 

3.3 Self-Corrector for Agreement: Reset Mode 

In this section we present a self-correcting protocol that given an agrement pro- 
tocol P whose error probability (as estimated by the self-tester) is sufficiently 
low, outputs the correct BA function for any input x G H by making calls to P. 
For the design we follow the approach of [23]: to self-correct P on input x, the 
source draws at random a value r from H, processors run P(r) and P(x— yr), and 
output P(r) +y P{x — y r). (This is a simple case of the random self-reducibility 
property.) Recall from Section 3.2 that the BA function admits two outputs de- 
pending on whether the sender is faulty or not. Thus, it is necessary to guarantee 
that either good outcome overwhelms the erroneous output. 

Assume that we are given a BA protocol P such that eTTOT{BA,P,Dv,C) <e 
(e.g., e = 0.05). The e-self-corrector for BA is shown in Figure 3. The protocol 
performs the two calls to P as described above, and then does a case analysis 
on the possible outcomes; this is repeated an adequate number of times. 

We will need the following statement in order to establish the correctness of 
the self-corrector (we instantiate for reasonable values of a). 

Proposition 1 . Let o > |, and let Xi,X 2 , . . . ,Xat be independent identieally- 
distrihuted 0/1-valued random variables sueh that Pr[X^ = l] >a,i = 1, 2, . . . , A. 
Then 



Pr 
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Protocol Agreement .Corrector (P, s, x, e, /3) 

jY ^ Q(^ ln(n//3) ^ 

for j = 1, - 

{ s chooses r Edy ^ 
outputi ^ P(r) 
output 2 ^ P{x —vr) 
if (outputi = -L A output 2 = -L) 
outcomcj ^ _L 

elseif (outputi ^ -L A output 2 ^ -L) 
outcomoj ^ outputi-\~v output 2 
else outcome j ^ 

} 

P[i] ^ most common outcome 1 ^ J < A, 
s.t. outcome j ^ 



Fig. 3. Self-corrector for Byzantine agreement; code for processor i 



The proof follows directly from Chernoff bounds. The “2/3” captures the fact 
that there are two correct outcomes of the protocol that are to overwhelm the 
erroneous one. 

Theorem 2. The protocol of Figure 3 is an e- self- correcting protocol for BA, 
for e < i. 

Proof (Sketch). By assumption, the probability that in a single run the outcome 
of processor i P[i] ^ r (resp., P[i] ^ x — r), for all i, is at most e. Thus, both calls 
to P return a correct outcome with probability at least 1 — 2e. Letting o = 1 — 2e, 
and adjusting the number of runs so that this happens for all processors, the 
claim follows from Proposition 1. n 



3.4 Self- Tester/Corrector for Agreement: Parallel Mode 

We now consider the case where the network is not reset between calls, namely, 
where faults are persistent. The problem in this case is that if a protocol is prone 
to behave badly at processor i and processor i fails, then the bad behavior of 
the protocol is “shielded” by the processor failure. We need a mechanism to 
unshield the protocol behavior, so that the remaining correct processors have 
enough information to assess its quality. 

To this end we suggest the technique we call ID re- assignment. The goal of the 
technique is to enable each and every remaining processor to “impersonate” every 
other processor in the network. Roughly, the implementation goes as follows. 
We do n parallel executions of the tester (resp. corrector) presented before, 
where each execution represents a different cyclic shift of the processors’ IDs. 
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This implies that processor 1 runs the protocol as itself, as processor 2, and 
so on, impersonating all other processors. Each processor collects a decision 
value for every impersonated ID. A unanimous decision by all impersonated 
IDs determines whether the protocol will pass the test (or output the corrected 
value). Since each run may fail with negligible inverse exponnential probability, 
the combined protocol still succeeds with overwhelming probability. 

Remark 1. ID re- assignment can be augmented to a randomized ID re- assign- 
ment via a shared random coin procedure. In environments with benign failures, 
such as the ones considered in this section, a shared random coin protocol is rel- 
atively easy to design (see, e.g., [5,13]). Using the shared coin procedure, proces- 
sors can draw random permutations of their IDs and execute an adequate number 
of parade runs under these re-assignments. This allows to self-test /correct under 
more powerful network adversaries, such as “biased” (asymmetric) or arbitrary 
(in the latter case, the adversary does not have access to the shared coin). 



4 Privacy-Preserving Self-Correctors 

The above already teaches us how the design of a self-testing/correcting pair has 
to take into consideration certain constraints. For example, it can only assume 
basic utility library routines to be correct. In the domain of distributed protocols 
other constraints may be posed on the problem and the addition of co-routines 
should not violate these requirements. This is particularly important in the case 
of secure protocols, where the design of self-testers and self-correctors should not 
create new security vulnerabilities. 

In this section we consider this aspect versus a passive adversary (i.e., the 
“private” or “gossip failure” setting). We assume the setting of secure distributed 
computing against an all-powerful adversary. A number of researchers, beginning 
with Ben-Or et al. and Chaum et al. [9,12], have designed distributed protocols 
for this setting. Informally, a protocol is said to be t-private if an adversary 
viewing the transcripts of any t parties learns nothing about the computation 
beyond what can be inferred from their t inputs together with the output. 

Following Definition 4, we say that a self-tester (resp., self-corrector) for P is 
privaey-preserving if (resp., C^) is private (with overwhelming probability) 
whenever P is private (with overwhelming probability). 

Assume that the n-ary function / is computed over a finite field F, and can 
be represented as a polynomial of degree d over F.^ Let denote the stan- 
dard n-dimensional vector space over F. In this section we assume that random 
choices are made according to the uniform probability distribution on F^. 
We assume, by the library paradigm of [11], that we have reliable programs to 
compute / loeally. This is a reasonable assumption, as typically the task of a 

^ Most of the protocols performing secure/private computations are — or can be tuned 
to be — over a finite field. Also, any function over a finite field can be represented as 
a polynomial. 
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distributed protocol is conceptually simple, and the whole point is to be able to 
cope with the unpredictability of its environment. 

A generic self-tester for P is given in Figure 4. The protocol simply verifies 
the local computation of / with the distributed execution of P at a number 
of randomly chosen inputs. The correctness of the self-tester is clear. Here we 
concentrate on the privacy issue, so we leave the constants unspecified — but 
assume that they satisfy the Chernoff bounds. 

Our privacy-preserving self-corrector for P is given in Figure 5. It uses the 
“low-degree polynomial technique” of Beaver and Feigenbaum [7] to transform 
a function call at the desired (non-random) input into a collection of function 
calls at uniformly random inputs. The following series of lemmas establish that 
the self-corrrector is privacy-preserving. 



Protocol Private_Self_Tester(/, P, , e, d) 

N ^ 

S ^0 

for m = 1, • • • , iV (in parallel) 

{ choose Ejjf^ F'^] 
compute locally f{x^^^) 

P{x^^^) 

if/^P 

} 

if S/N > cj Const 
“FAIL” 
else “PASS” 



Fig. 4. Generic self-tester for privacy; code for proc. i 



Lemma 2. If all oracle calls to P are correct^ then the self- correct or of Figure 5 
outputs f{x). 

Proof (Sketch). By the same construction as in a theorem in [7], it follows that 
the value returned by P(v^^^) corresponds to g{k) for some univariate polyno- 
mial g of degree at most d. By the same construction, we know that ^(0) = f{x). 
n 



Lemma 3. If at most one call to the oracle is not t-private, then the self- 
corrector is t-private. 
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Protocol Privacy _Preserving_Self_Corrector(P, x, e, ( 3 ) 

TV ^ 

for m = 1 , • • • , iV 
{ choose ai Eu F 
for /c = 1, • • • , 0? + 1 

^ Oi/c + Xi 

for /c = 1, • • • , 0 ? + 1 (in parallel) 

P{v^^^) (yielding the share of Wk) 
interpret Wk as g(k) for some function g of degree d 
interpolate the function g at the points 1, 2, . . . , d + 1 
using the shares of ici , iC 2 , . . . , Wd-\-i 
answerm ^ (share of) ^(0) 

} 

P[i] ^ most common answer in answerm, 1 < m < N 



Fig. 5. Privacy-preserving self-corrector; code for proc. i 

(k) 

The proof follows from the pairwise independence of the . Note the impor- 
tance of performing the interpolation of g from the shares of the {wi]i. Those 
intermediate values might leak information about the (and hence the Xi) 
even if the protocol P is perfectly private, depending on the nature of the un- 
derlying function / being computed by P. 

Theorem 3. The protocol of Figure 5 is a privacy-preserving self- cor- 

rector for any function f of degree d (with respect to the distribution )• 

Proof (Sketch). Because error(/, P, f all {d -\- 1) outputs of P 
are correct with probability at least 1 — e each time through the loop. If all 
(d+ 1) outputs are correct we know by Lemma 2 that answer^ = /(x). Chernoff 
bounds provide the desired confidence, t-privacy follows from Lemma 3. □ 

5 Conclusions 

This work has initiated the area of self-testing/ correcting protocols^ which gives a 
methodology to add reliability on-line and in situ to protocols by adding simple 
co-routines. In [11] and its extensions numerous numerical functions have been 
shown to have sequential self-testers and self-correctors. Applying the library 
paradigm one can assume that the correct sequential program is available, and 
use it to design a generic tester for the distributed version of the same numerical 
function (details deferred to the full version). 
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Abstract. The problem of performing t tasks in a distributed system 
of p processors is studied. The tasks are assumed to be independent, 
similar (each takes one step to be completed), and idempotent (can be 
performed many times and concurrently). The processors communicate 
by passing messages and each of them may fail. This problem is usually 
called DO-ALL, it was introduced by Dwork, Halpern and Waarts. 

The distributed setting considered in this paper is as follows: The sys- 
tem is synchronous, the processors fail by stopping, reliable multicast is 
available. The occurrence of faults is modeled by an adversary who has 
to choose at least c • p processors prior to the start of the computation, 
for a fixed constant 0 < c < 1, must not fail the selected processors but 
may fail any of the remaining processors at any time. 

The main result is showing that there is a sharp difference between the 
expected performance of randomized algorithms versus the worst-case 
deterministic performance of algorithms solving the DO-ALL problem in 
such a setting. 

Performance is measured in terms of work and communication of algo- 
rithms. Work is the total number of steps performed by all the processors 
while they are operational, including idling. Communication is the total 
number of point-to-point messages exchanged. Let effort be the sum of 
work and communication. A randomized algorithm is developed which 
has the expected effort (9(t + p • (1 + log* p — log*(p/t))), where log* is 
the number of iterations of the log function required to go with the value 
of function down to 1. For deterministic algorithms and their worst-case 
behavior, a lower bound f2(t + p • log t/ log log t) on work holds, and it is 
matched by the work performed by a simple algorithm. 



1 Introduction 

The problem of performing t tasks in a distributed system of p processors is 
considered. The tasks are assumed to be independent, similar, and idempotent. 
The processors communicate by passing messages and are prone to failures. This 
problem is usually called DO-ALL. 

Review of prior work. The problem DO-ALL was introduced by Dwork, 
Halpern and Waarts [5], and subsequently studied in [3,4,6]. In these papers, 

* This research was supported by KBN contract 8 TllC 036 14. 



P. Jayanti (Ed.): DISC’99, LNCS 1693, pp. 284-296, 1999. 
© Springer- Verlag Berlin Heidelberg 1999 



Randomization Helps to Perform Tasks on Processors Prone to Failures 



285 



if processors fail by stopping then the algorithms are required to perform all the 
tasks even if only one processor survives. Dwork, Halpern and Waarts [5] con- 
sider work defined as the total number of tasks performed counting multiplicities, 
and communication measure defined as the total number of messages sent. They 
develop three protocols solving the DO-ALL problem, all of them work optimal. 
They also consider the effort complexity defined as the sum of work and commu- 
nication. One of their algorithms has effort complexity 0{t De Prisco, 

Mayer and Yung [4] consider work defined as the available processor steps, that 
is, each processor contributes a unit to the measure of work for each step when it 
is operational. They present algorithm with work and communica- 

tion 0((/+l)p), where / is the number of faults. In their algorithm, one processor 
is designated as an active coordinator which allocates tasks and receives reports 
of their completion. The coordinator may change over time. In [4] a lower bound 
f2{t {f l)p) on work is proved for any algorithm that uses the checkpointing 
strategy. Another algorithm is developed by Galil, Mayer and Yung [6]. Paper [6] 
presents a solution of the Byzantine Agreement in the presence of crash failures 
using the optimal linear number of messages. The algorithmic technique used in 
the solution allows to improve the message complexity of the algorithm in [4] 
to 0{fp^ + min{/ -f l,logp}p), for any positive e, while achieving the available 
processor steps 0{t (/ + l)p)- Chlebus, De Prisco and Shvartsman [3] present 

two algorithms based on an aggressive coordination paradigm by which multiple 
coordinators may be active as the result of failures. One algorithm is tolerant of 
fail-stop failures, it has available processor steps 0{{t + plogp/ loglogp) log /) 
and communication 0{t^plogp/ loglogp + /p). The other algorithm is the only 
one known in the literature that tolerates restarts. Its available processor steps 
complexity is 0{{t -\-p\ogp-\- /) • min{logp, log/}), and its message complexity 
is 0{t-^p\ogp -k fp). 

Summary of contributions. We present a randomized algorithm solving the 
DO-ALL problem and show that its expected performance is much better then 
worst-case behavior of any deterministic algorithm. The system is assumed to be 
synchronous, and a reliable multicast is available. The processors fail by crashing. 
The failures are controlled by an adversary who has to choose at least c • p pro- 
cessors prior to the start of the computation, where 0 < c < 1 is a fixed constant, 
must not fail the selected processors but may fail any of the remaining processors 
at arbitrary times. The performance of algorithms is measured by work, which 
is the available processor steps, and communication, which is the total number 
of point-to-point messages. Define effort to be work plus communication. Let 
log* be the number of iterations of the log function required to go with the 
value of function down to 1; formally log^^^ x = logx, log^^^^^ x = log(log^^^ x), 
and log* X = mmk[\og^^^ X < 1]. Our randomized algorithm has the expected 
effort 0(t -kp(l + log* p — log* |)). The efficiency of this randomized algorithms 
is compared with the worst-case behavior of deterministic algorithms. A lower 
bound Q(t + plog t/ loglog t) holds for the amount of work performed in the 
worst case by any deterministic algorithm. For instance, if t = 0{p) then the 
randomized algorithm has the expected work 0(p-log*p), but any deterministic 
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algorithm requires work Q{plogp/ loglogp) in the worst case. Given a deter- 
ministic algorithm, the adversary knows the future behavior of processors and 
may compare all the possible scenarios of failures; in particular if the adversary 
wants to fail a specific processor at a specific step to prohibit it from performing 
a certain task at this step, then also all the other processors (if any) assigned for 
the same task need to be failed simultaneously. Randomization helps because 
the adversary has no way to guess the pattern of assignments among tasks and 
processors if it is created on-line in a random fashion. The lower bound on work 
performed by any deterministic algorithms is matched by the worst-case work 
performance of a simple algorithm, which has the communication complexity 
0{t logt/ log logt). 

Other related work. In shared- memory models there is a natural problem 
corresponding to DO-ALL. Given an array initialized with zeroes in the shared 
memory, a task is to change the value stored at a location of the array. The 
problem of performing all these tasks is called write- ALL, it was introduced by 
Kanellakis and Shvartsman [9,10]. Efficient solutions to this problem can be used 
iteratively to convert arbitrary shared-memory computations into robust ones 
resilient to processor failures. This was the subject of papers by Kedem, Palem 
and Spirakis [13], Kedem, Palem, Rabin, and Raghunathan [11], Martel, Park, 
and Subramonian [15], and Anderson and Woll [1]. A logarithmic lower bound 
on time to solve write-all deterministically was derived by Kedem, Palem, 
Raghunathan and Spirakis [12], and on expected time of randomized execu- 
tions by Martel and Subramonian [14]. Kanellakis and Shvartsman [9] proved 
a lower bound i?(n logn/ log logn) on available processor steps in the model 
with memory snapshots, and developed an algorithm with a matching perfor- 
mance, where n denotes both the number of processors and the size of the array. 
Buss, Kanellakis, Ragde and Shvartsman [2] proved a lower bound f2{n\ogn) on 
available processor steps for any deterministic executions controlled by an ad- 
versary that can cause both failures and restarts, and developed an algorithm of 
a matching performance in the model with memory snapshots. For deterministic 
computations, the most efficient algorithm performs work 0{nlog^ nj log log n), 
and if restarts are allowed then the best upper bound is for arbitrary 

positive e; the best lower bound is f2{n\ogn) in both cases. For randomized 
computations, expected work 0{nlogn) can be achieved even with restarts. 
Contents of the paper. Section 2 presents details of the model and algorith- 
mic paradigms. In Section 3 we present a randomized algorithm RA, prove its 
correctness and analyze its expected performance. Section 4 contains discussion 
of the deterministic case, in particular a lower bound on work performed in the 
worst case by any deterministic algorithm, and algorithm DA of a matching work 
performance. Section 5 contains final remarks. 

2 Model and Algorithmic Preliminaries 

Distributed setting. There are p synchronous processors, each having a unique 
number in the range [l..p]. Processors communicate by passing messages. There 
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is no upper bound on the size of messages. A stage of computation consists of 
three clock cycles called substages: reeeive (receiving messages), eompute (per- 
forming local computations), and send (multicasting a message). A message sent 
at a clock cycle is received during the next clock cycle. A processor may in one 
clock cycle multicast a message to any selected subset of processors, or receive 
all the incoming messages. A local computation may include performing a single 
task. The broadcast or multicast operations are reliable (cf. [8]), in the sense 
that if a processor A is attempting to perform such an operation and fails in this 
clock cycle then either no processor receives a message sent by A in this step or 
the messages sent by A in this clock cycle are successfully delivered to all the 
operational recipients (in the next step). 

Tasks. There are t tasks to be performed, all the processors know the number t 
and each processor may at any clock cycle perform any specific task. The tasks 
are assumed to be independent, what means that any relative order of performing 
them is allowed. The tasks are similar, this means that each of them takes one 
step to be completed. Tasks are also idempotent, this means that they can be 
performed many times and concurrently. 

Model of failure. The processors are prone to stop- failures. If a processor 
crashes then it stops performing any activity, and never restarts. The failures 
are controlled by an adversary. There is a fixed constant 0 < c < 1, such that 
the adversary has to choose at least c • p processors prior to the start of the 
computation, must not fail the selected processors but may fail any of the re- 
maining ones at any time. The algorithms presented in this paper do not resort 
to this constant c. Such an adversary has some features of the off-line adversary 
(choosing a subset of processors prone to failures in advance) and some of the 
adaptive adversary (failing the processors prone to failures on-line). 

Performance measures. Two independent complexity measures are consid- 
ered: work and communication. Work means the available processor steps, as 
defined by Kanellakis and Shvartsman [9,10]. It counts all the steps performed 
by operational processors, including busy waiting; in other words, any opera- 
tional processor contributes one unit to work for each clock cycle when it is not 
faulty. Communieation is measured as the total number of point-to-point mes- 
sages sent over the network, in particular a processor broadcasting a message 
to all the other processors contributes p — 1 to the communication complexity. 
Effort is the sum of work and communication. 

Coordinators and workers. In the course of an algorithm, the processors may 
be categorized into eoordinators and the remaining ones, referred to as workers. 
A coordinator maintains this status till the end of computation, unless it fails, 
but a processor with the current status of a worker may later become a coordi- 
nator. The job of a coordinator is to repeatedly collect incoming messages about 
the progress of work, combine them, and send back to all the processors which 
are not known to have failed yet. Sending out such a message by a coordinator 
also serves the purpose of confirming that this coordinator is still alive. A mes- 
sage sent to every coordinator is called a report^ and a message of a coordinator 
sent to every processor is a summary. 
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Phases. The computation proceeds in phases comprised of four consecutive 
stages. A phase may be performed in one of two modes: work or election. In the 
work mode a phase consists of the following four stages: (1) performing a task 
and sending reports, (2) receiving reports by the coordinators, combining them 
into summaries and sending the summaries, (3) receiving summaries, updating 
the local knowledge, and deciding on the mode of the next phase, (4) a pause. 
Each processor, including the coordinators, sends a message about the tasks done 
to every coordinator that either proclaimed itself a coordinator or confirmed its 
existence in the previous phase. Each processor maintains a list of tasks of which 
it does not know as being performed yet, and a list of processors of which it does 
not know as having failed yet. Each time a processor sends a report to the 
coordinators or a coordinator sends a summary to all the processors then these 
two whole lists are carried in the message. If a processor is sure that all the 
tasks have been completed, then it terminates. If no coordinator is heard during 
a phase then the remaining operational processors start an election process to 
select new coordinators, this is done by switching to the election mode. During an 
election phase, a worker may decide to be a coordinator, then it proclaims this by 
broadcasting a suitable message to all the processors it expects to be operational. 
More precisely, in the election mode a phase consists of the following four stages: 
(1) referring to a randomly set local variable to check if having qualified for a 
new coordinator, and if this is the case then sending out a proclamation, (2) 
receiving proclamations, updating the list of coordinators, and sending reports 
to new coordinators, (3) (performed only be the new coordinators) receiving 
reports, then sending out summaries, (4) receiving summaries, and if having 
heard from at least one coordinator then switching to the work mode. 

Local knowledge. Each processor A maintains three lists: list of outstanding 
tasks Tasks^, list of operational processors Processors^, list of coordinators 
Coordinators^. Lists Tasks^ and Processors^ are overestimates of the true 
situation because if a processor fails then it takes a number of steps for this 
information to reach every other processor, similarly if a task is performed. At 
the start of computation, each processor A obtains a list of all the tasks and a 
list of all the processors, sorts the lists and initializes Tasks^ with the list of 
all the tasks and Processors^ with the list of all the processors. Then A sets 
Coordinators^ to be empty. Message that processor A sends includes the 
following data: the identification of M, the coordinator /worker status of M, and 
lists Tasks^ and Processors^. 

3 Randomized Algorithm 

3.1 Algorithm RA 

Each processor A maintains a number of local variables. One of them is the 
variable levels, initialized to 1/2. It is increased by replacing the current value x 
by 2 • X. Next is the variable coordination_priority^, it is initialized to a 
random integer value in the range [l..p], and never changed; the values of these 
variables of distinct processors are independent of each other. All the processors 
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have unique names in the range they are used to assign tasks to processors 

if there are more tasks then processors. The variable coordination_priority^ 
does not need to be unique, it is used by processor A to decide if it becomes 
a coordinator. If the inequality levels > coordination_priority^ holds for 
the first time during a certain phase then processor A is said to qualify for a 
coordinator^ after that the processor becomes a coordinator. The summary sent 
by a coordinator in the first phase when it qualifies for a coordinator is called 
the proclamation of the coordinator. Another variable is the Boolean one mode 
which takes one of two values work and election. The computation starts from 
the election mode. 

After initialization of the variables and lists, all the processors repeat four- 
stage phases in a loop until all of them terminate. Phases are either election or 
work, according to the mode. The following are detailed descriptions of phases. 

Election phase of processor A of algorithm RA 

Stage 1: Receive substage not used. In compute substage increase 
level^. Then check if the test to qualify for a coordinator is passed. 

If this is the case then send proclamation in send substage to 
every processor on list Processors^. 

Stage 2: In receive substage receive all the proclamations. In com- 
pute substage, for each proclamation Ms received, add B to list 
Coordinators^. If this list is nonempty then in send substage send 
report to every coordinator on it. 

Stage 3: Let ^ be a new coordinator, otherwise pause in this stage. Re- 
ceive reports in receive substage. In compute substage update two 
lists. On list Tasks^: remove any item T such that it is not included 
in some list Tasks^s contained in message Ms just received. On list 
Processors^: remove any item B such that no report of proces- 
sor B was received. In send substage send summary to all the 
processors on list Processors^. 

Stage 4: In receive substage receive all the summaries. If at least one 
summary received then in compute substage update three lists. The 
lists Tasks^ and Processors^ are updated similarly as in stage 3. 
Updating of list Coordinators^: for any item B on this list, if no 
summary received from it in the previous stage, then remove B from 
both lists Coordinators^ and Processors^. If list Coordinators^ 
nonempty then switch the mode to work. Send substage not used. 

Work phase of processor A of algorithm RA 

Stage 1: Receive substage not used. In compute substage perform a 
task: if the size of list Tasks^ is at least as large as the size of list 
Processors^ then perform task T such that the rank of T in Tasks^ 
is equal to rank of A in Processors^, otherwise select a random 
task from Tasks^ and perform it. Remove T from Tasks^. In send 
substage send report to every processor on list Coordinators^. 
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Stage 2: Let ^ be a coordinator, otherwise pause in this stage. In re- 
ceive substage receive all the reports. In compute substage: update 
lists Tasks^ and Processors^, similarly as in the election phases. 

In send substage send the summary to all the processors on list 
Processors^. 

Stage 3: In receive substage receive all the summaries. In compute sub- 
stage update lists Tasks^, Processors^ and Coordinators^ sim- 
ilarly as in election phases. If list Coordinators^ is empty then 
switch the mode to election. If list Tasks^ is empty then terminate. 
Send substage not used. 

Stage 4: Pause for the whole stage. 

The phases have been designed to facilitate exposition and simplify the cor- 
rectness proof. Tasks are not performed during election phases, to have task- 
oriented work separated from coordinator-selection work. Messages sent by pro- 
cessors have always the same form, they include all the information that may 
sometimes be useful. Hence there is a room for economizing on the size of mes- 
sages, for instance the coordinator /worker part is not really needed. 



3.2 Correctness of RA 

A phase is productive if it ends in a work mode of some processor. Let local view 
of processor A consists of the values of variables mode^ and levels, and lists 
Tasks^, Processors^ and Coordinators^. 

Lemma 1. In any execution o/RA, the local view of any two operational pro- 
cessors is the same at the end of a productive phase. 

Proof Induction on the number of phases. A phase is productive if at least 
one coordinator sent its summaries. They were received by all the processors 
by the inductive assumption and the reliability of multicast. The summaries 
were based on the same set of reports, by the same argument, hence updates of 
local lists of tasks, processors and coordinators produced the same result. The 
value of variable mode^ depends on list Coordinators^. It follows that all the 
processors simultaneously perform either work or election phases. Variable level 
is increased simultaneously by all the processors during election phases. □ 



Lemma 2. Algorithm RA terminates after at most t + p + 2 logp phases. 

Proof. The number of election phases is at most log(l — c)p because the instances 
of variable level are simultaneously increased during each such a phase, by 
Lemma 1. Each work phase that is not productive precedes an election phase, 
hence the number of such phases is not larger than the number of election phases. 
It is enough to count the productive phases. Observe that during such a phase 
each processor A either removes a task from Tasks^ or adds a processor to 
Coordinators^. Namely, if it is a work phase, then each processor performs 
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a task and removes it from its list of tasks, at the end of this phase all the 
processors have removed the task from their lists of tasks by Lemma 1. If this is 
an election phase then new coordinators are added to Coordinators^. It follows 
that the number of such phases is at most t p. □ 

Theorem 1. Algorithm RA terminates with all the tasks performed. 

Proof. By Lemma 2, algorithm RA terminates. Processor A terminates if list 
Tasks^ is empty. Processor A has removed a task from this list either if it per- 
formed it itself or received information in a summary that it had been performed. 

□ 



3.3 Analysis of RA 

When processors select randomly tasks to be performed, the behavior of algo- 
rithm RA can be modeled by the following process of elearing s urns with n 
balls: at the start there are s empty urns, a step begins with a random place- 
ment of n balls in the given set of urns, then the urns containing at least one 
ball are removed; such steps are iterated until no urn remains. 

Lemma 3. The process of clearing s urns with n balls, for n > s, terminates 
within log* n — log* ^ + 0{\) steps with the probability at least 1 — , for 

sufficiently large n. 

Proof. If k balls are randomly placed in m urns then the expected number of 
empty urns is 



m • 


fm — 1\ ^ 

= m • 


(i-i) 




\ m ) 


V mJ 



First consider the case s = n. Let Xi be the random variable equal to the number 
of empty urns after step i. We may assume that the balls in a step are placed 
one by one. Let Xi^k be the number of empty urns after the k-th. ball has been 
put in a random urn during step i. 

Let Yi^k = E [Xi^k I ^i,i^ • • • Then the sequence (fYi^k)k is a martin- 

gale (cf. [7]). Adding one ball changes the number of empty urns by at most 1, 
hence \Yi^k+i ~ ^i,k\ < 1- This allows to use the method of bounded differences 
(cf. [16]), and to estimate probabilities of deviations by the Azuma inequality, 
which takes the form 



P [Xi — YiXi >x]< exp(— x^/2n) , 



( 1 ) 



because Xi = This inequality holds for any number of urns used in step i. 

Let = e^^^\ where e^^^ = e. We have EXi < n/e. Then by (1) the 

inequality P [Xi > ^ + ein] < e~^^^ holds for each sufficiently small ei > 0 and 
some corresponding > 0. Let event A\ hold iff Xi < ^ -h ein. Then 



E [X 2 I A\] < n{e ^ -h ei) exp(- 



n 

e • 



n{e 1 -h ei) 
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with ei ^ 0 . Let event A2 hold iff X2 < . Then P [—A2 \ Ai] < e for 

a certain 62 > 0 . The case of X3 is similar to X2, namely 

E[Xs |^in^2] . 

Let As hold iff Xs < n/e^^\ We can estimate the probability similarly as before: 
[—As I Air\A2] < exp(— n/2(e^^^)^). This leads to a generalization: define Ak 
to hold iff Xk < nle^^\ and define 



a/c = P [—Ak I w4i n • • • n Ak-i] . 

Then the following inequalities hold: 



E [Xk I v4i n ■ ■ ■ n Ak-i] < g(fe_i)g(fe) 



and 



Partition the n-process into two stages. Let the first stage terminate if the num- 
ber of empty urns is less than n/\ogn. If k < log* x then < logx. We 
estimate the probability that the first stage terminates within log* n — 1 steps. 
Let u = log* n. The probability that all events Ai hold, for 3 < i = log* n, 
is at least 

u—l u—1 u—1 u—1 

R(1 - ai) = exp(ln JJ(1 - ai)) = exp(y]ln(l - ai)) > exTp{-2'^ai) , 
i =3 i =3 i =3 i =3 

for ai < 1/2. Also 

ai < exp(— n/2(e^*^)^) < exp(— n/2 log^ n) , 

for 3 < i < u. Hence '^s<i<u ^ ^ ’ exp(— n /2 log^ n). We need to estimate 

^ [rii<2<iog* n number is at least 

(1 _ (1 - exp(-2 ■ log* n ■ = 1 - . 

This is an estimation of the probability that the first stage is completed within 
log* n — 1 steps. 

In stage two, the first step is of placing n balls into n/ logn urns. Let Z be 
a random variable equal to the number of empty urns after this step, then 

EZ=y^-e-'°s" = 0(l/logn) . 

logn 

From the Azuma inequality: 

P [Z — E Z > d^/n] < exp(— d^/ 2 ) . 
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For d = with probability 0{e there are empty urns. Consider 

the next step of the second stage: the probability that a specific urn is empty is 
)^) = With probability there is an empty 

urn after that step. 

Thus the overall probability that after at most 1 + log* n steps there is no 
empty urn is at least 

l-Oflog*n-expf - 0(e-^) - 0(n3/4 . > 1 - , 

V V 21og n// 

for sufficiently large n. 

Finally consider the case s < n. We need to count the number of steps needed 
to have at most n/logn empty urns. During each of these steps we estimate 
the conditional expected number of the remaining urns by number of the form 
g(fc-i)g(fc) ; which is less than n/ \ogn for k = log* n — log* ^ + 0(1). The rest of 
the proof is the same as in the case n = s. □ 

Lemma 4. The expected work of algorithm RA is 0(t +p- (1 + log* p — log* y)). 

Proof. First consider the productive work phases. If during a phase the number 
of outstanding tasks is not smaller than the number of operational processors, 
then each task is performed by at most one processor, which is assigned to the 
task in a deterministic way. The total amount of work performed in this fashion 
is 0{t). Now consider the phases when processors select tasks randomly. The 
number of operational processors is at least cp during each step. The process 
of diminishing the outstanding tasks is modeled by the process of clearing 0{p) 
urns with 0{p) balls. Hence the work performed is 0{p{l + log*p — log* j)) by 
Lemma 3. 

Next we estimate the number of election phases, which also gives a bound 
on the number of unproductive work phases. Let X be the minimum value of 
levels for all the processors A that never fail. There are at least cp of such 
processors, and the following estimation holds: 

EX = y]p[x > i] < y](i - < ^g-c(i-i) ^ ^ 

i=i i=i ^ i=i 

Hence the expected number of election phases is 0(1) and the expected work 
during such phases and the unproductive work phases is 0{p). □ 



Lemma 5. The expected number of messages sent while RA is executed is 0{t-\- 
P' (1 + log* p- log* f)). 

Proof Let X be the minimum value of variable levels for all the processors A 
that never fail. Let + be the number of processors B such that the inequality 
level 23 < 2^ holds, where j is the integer satisfying 2^~^ < X < 2T Num- 
ber E y is the expected number of coordinators in the course of algorithm. The 
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expected number of phases is 0(1 + | + log*p — log* y). The expected number 
of messages sent during a phase is 0{p - 'EY). We estimate EK as follows: 

p l+logp 

F,{Y) ='^F,{Y\X = i)P{X = i) < ^ 2^'P [2^-1 < X < 2^] 






i=i 



l+logp 



^ E E 2^exp(-c2^-i) <-^fce-'='' = 0(l) . 



1+logp 



/c=l 



i=i i=i 

Straightforward calculations complete the proof. 

Lemmas 4 and 5 imply: 

Theorem 2. T/ie expected effort of algorithm RA 0(t+p-(l+log* p— log* j)) 



□ 



4 Deterministic Computations 

In this section we discuss the deterministic case, to compare it with the random- 
ized one. To show that randomization really helps, it is sufficient to restrict our 
attention to work performance only. 

Consider the scenario in which all the processors have complete knowledge 
of the progress made, in terms of the tasks already completed. This is similar to 
computations on a PRAM with memory snapshots, solving the write- ALL prob- 
lem. Kanellakis and Shvartsman [9] proved a lower bound for such computations 
solving the write- ALL problem (see also [10]). The adversary considered in [9,10] 
may fail all but one of the processors. Notice that in our situation the adversary 
is weaker, because it is prohibited from failing certain i?(p) processors. However 
a lower bound of a similar form can be proved in the model of this paper, by an 
adaptation of the arguments from [9,10]. More precisely the following holds: 

Theorem 3. Any deterministic algorithm solving DO-ALL problem requires work 
Q{t +p • log iQg ^ ) i'^ the worst case. □ 

A full proof will be presented in the final version of this paper. 

For completeness sake, we present a deterministic algorithm of a matching 
work performance. Let us call it DA, it is similar to RA. The main differences 
are as follows: (1) all the processors act as coordinators, and (2) tasks are per- 
formed in a load-balanced way, that is, processor A of rank i in list Processors^ 
performs the task of i-th rank mod the length of Tasks^. There are no election 
phases in DA, only work phases performed in a loop. Work phase consists of 
the first three stages of the work phase of RA, with Processors^ playing the 
role of Coordinators^, and the variables needed for elections not used. The 
correctness of DA is proved similarly as the correctness of RA. 



Theorem 4. The algorithm DA performs work 0(t + log^fg^^ ) the worst case. 
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Proof. The method of analysis of algorithm AN for the case of failures without 
restarts, as presented in [3], can be applied here. More precisely, the work of 
algorithm DA corresponds to the work of algorithm AN during the attended 
phases. Details will be provided in the final version of this paper. □ 

By skipping certain communication steps in the initial part of algorithm, the 
number of messages of algorithm DA can be made equal to 0(tPp^ • ^ ) • 

5 Discussion 

We described the performance of algorithm RA in terms of the expected work and 
communication. If one would like to have bounds on work and communication 
holding with the probability polynomially close to 1, then such bounds for RA are 
0{t Tploglogp) on work and 0{t p log p) on communication. 

Optimality of solving DO-ALL by a randomized algorithm in the setting of 
this paper is an open problem. We claim that the expected work of any such 
algorithm is i?(plog*p) for t = p. 
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Abstract. In this paper, we study the problem of efficiently scheduling 
a wide class of multithreaded computations, called strict; that is, com- 
putations in which all dependencies from a thread go to the thread’s 
ancestors in the computation tree. We present the first scheduling al- 
gorithm which applies to any strict multithreaded computation and is 
provably efficient in terms of execution time, space complexity and com- 
munication cost The algorithm is distributed, randomized, works in an 
asynchronous way and follows the work- stealing paradigm. Our analysis 
applies for both shared-memory and distributed-memory parallel ma- 
chines and generalizes the one presented in [5], which applies only to 
fully strict multithreaded computations; that is, computations in which 
all dependencies from a thread go to the thread’s parent. 



1 Introduction 

Dynamically growing multithreaded computations are nowadays quite common 
for parallel computers. Multithreaded models of parallel computation have typ- 
ically been proposed as a general approach to model dynamic, unstructured 
parallelism and have been employed by several parallel programming languages. 
To specify parallelism, a thread can spawn child threads. Additionally, a thread 
may synchronize with some or all of its (direct or indirect) descendants by sus- 
pending its execution until a descendant reaches a specific point of computa- 
tion. A multithreaded computation is identified with a directed acyclic graph, 
henceforth abbreviated as dag. The nodes of this dag represent instructions of 
threads, while there are three kinds of different edges, continue edges for organiz- 
ing instructions into threads, spawn edges for representing spawning of threads 
from other threads, and dependency edges declaring data or synchronization de- 
pendencies among different threads (see e.g.. Figure 1). Spawn edges organize 
threads into a rooted tree, called spawn tree. 

For the execution of a multithreaded computation on a parallel computer, 
one should specify which processor executes which threads and when each thread 
should be executed. Apparently, this is not a desirable task to be undertaken by 

* This work has been supported in part by the European Union’s ESPRIT Long Term 
Research Project ALCOM-IT (contract # 20244). 
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a programmer. On the other hand transferring the control of these scheduling 
decisions to the run time system might not be wise, unless the system guarantees 
that it will make good scheduling decisions in order to execute the program 
efficiently. A good scheduling technique must ensure that enough threads remain 
active to keep the processors busy, while at the same time, the concurrently active 
threads must be within limits in order to control the (dynamic) memory needed. 
Moreover, in order to reduce the communication among processors, one should 
try to maintain related threads on the same processor. Apparently, designing a 
scheduler to achieve all of the above goals is not a trivial task. 

Two scheduling paradigms have been considered in the past, work-sharing 
and work- stealing. In work-sharing, overutilized processors try to migrate some 
threads to other (hopefully underutilized) processors. On the contrary, in the 
work-stealing paradigm, underutilized processors “steal” work from other pro- 
cessors. The work- stealing paradigm dates back at least as far as Burton and 
Sleep’s research [10] on parallel execution of functional programs and Halstead’s 
implementation of Multilisp [15]. Since then a lot of work has been done in this 
direction (see e.g., [1,3,4,5,6,7,12]). 

Three important performance parameters of scheduling algorithms for multi- 
threaded eomputations on parallel computers are the required spaee^ their exeeu- 
tion time^ and the eommunieation eost incurred during the course of an execu- 
tion; the first is characterized by the amount of storage needed for an execution, 
the second is the total number of steps needed for executing all threads, while 
the last is the amount of communication incurred for keeping more than one 
processors busy and resolving any kind of dependencies that may exist between 
threads executing on different processors. 

An algorithm achieves linear speedup if its execution time with P processors 
is P times faster than the optimal execution time on a one-processor computer. 
Since a processor can execute only one instruction at each time step, an algorithm 
that achieves linear speedup is optimal in terms of execution time. An algorithm 
uses linear expansion of memory if its space requirements are not more than 
P times the space requirements for the execution of the same computation on 
a one-processor computer. There exist very simple and common multithreaded 
computations that require linear expansion of memory in order to achieve linear 
speedup (see e.g., [4, Section 2.3]). Hence, a scheduling algorithm that uses linear 
expansion of memory is arguably efficient in terms of space complexity. 

Say that a multithreaded computation is fully striet when all dependencies 
from a thread go to the thread’s parent. Apparently, the class of fully strict multi- 
threaded computations is a rather restricted one. On the opposite extreme, a gen- 
eral multithreaded computation allows arbitrary dependencies between threads. 
There is, in addition, an important middle ground between fully strict and gen- 
eral mulithreaded computations: those in which dependencies from any thread 
are directed to the thread’s ancestors in the spawn tree. Call these computations 
striet. Apparently, the class of strict multithreaded computations is much wider 
than the class of fully strict ones. Clearly, the first class encompasses all dis- 
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tributed multithreaded applications in which communication between threads 
and any of their ancestors is required. 

Blumofe and Leiserson [6] have proved that there exists no scheduling al- 
gorithm for general multithreaded computations to achieve both linear speedup 
and linear expansion of memory. They have presented a multithreaded computa- 
tion for which every algorithm can not achieve not even a factor of two speedup 
without avoiding space requirements to grow as a function of the serial execu- 
tion time. However, they have proved that by restricting to the class of strict 
multithreaded computations, there exist efficient execution schedules that both 
achieve linear speedup and use linear expansion of memory. Thus, these results 
imply that if one wants to find a scheduling algorithm with good execution time 
and memory complexity, restricting to strict multithreaded computations does 
the job. 

In a pioneering work, Blumofe and Leiserson [5] have considered the class of 
fully strict multithreaded computations and they have presented the first prov- 
ably good work-stealing scheduler (in terms of all three performance parameters) 
for such computations. However, the class of fully strict multithreaded compu- 
tations is very restricted, compared to the more general class of strict compu- 
tations. Thus, an interesting question left open by their work is the existence of 
a provably good scheduling algorithm for (general) strict multithreaded compu- 
tations. Blumofe and Leiserson [5, Section 7] have pointed out that generalizing 
their analysis to work for general strict computations is indeed both important 
and difficult. 

In this paper, we show that it is, in fact, possible to enjoy good performance 
properties for this much wider class of multithreaded computations. Our re- 
sult significantly enhances the class of multithreaded computations that can be 
scheduled efficiently in terms of both execution time and memory complexity. 
More specifically, we present a distributed, asynchronous, work-stealing schedul- 
ing algorithm for (general) strict multithreaded computations and we prove good 
bounds on its performance. Our algorithm enjoyes nice locality properties, while 
simultaneously it achieves to schedule together related threads. Moreover, we 
prove that our algorithm is efficient in terms of execution time, as well as space 
complexity. Our analysis applies not only for shared memory computers, but 
also for distributed memory systems. More significantly, we provide bounds on 
the communication complexity of our algorithm for such systems. 

Our algorithm achieves expected execution time 0{TijP + /iToo), where Ti 
is the optimal execution time with one processor. Too is the length of a path of 
maximum length in the instruction dag, and h is the dependency height of the 
computation; that is, h is the maximum “distance” in the spawn tree between 
any two threads that need to communicate during the course of the execution. 
Clearly, no algorithm can achieve execution time better than Ti/P since each 
processor can execute only one task at a step. Moreover, no algorithm can achieve 
execution time less than Too, since none of the instructions on this path can be 
executed in parallel with any other instruction on the path. Notice that our 
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algorithm achieves linear speedup, whenever P < Ti/(hToo), so that hToo < 
TilP. 

The space complexity of our algorithm is 0(5'iP), where Si is the optimal 
space complexity with one processor. Thus, our algorithm uses linear expansion 
of memory. Recall that an algorithm that uses linear expansion of memory is 
arguably efficient in terms of space complexity. 

The expected communication complexity of our algorithm is 0(PhToo(l + 
"^d)Smax)i where Pmax IS the maximum size of storage needed by any thread of 
the computation and Ud is the maximum number of dependency edges entering 
any thread. Wu and Kung have proved in [19] that there exist fully strict mul- 
tithreaded computations, such that any algorithm that achieves linear speedup 
incurs total communication at least 0{PTooSmax)- Thus, scheduling algorithms 
for fully strict multithreaded computations that achieve this amount of commu- 
nication are optimal in terms of communication cost. Our communication bound 
diverges by a factor of h from this lower bound. However, we argue that this 
divergence is not unexpected, since h appears to be a measure of the degree of 
“strictness” of multithreaded computations. Since for fully strict multithreaded 
computations h = 1, this divergence of our communication bound from the 
lower bound provided for fully strict multithreaded computations appears to us 
natural. 

We also prove that for any e > 0, with probability at least 1 — e the algo- 
rithm has execution time 0(Ti /P + hT^ + log P + log(l/ e)) and communication 
complexity 0{P{hToo + log(l/e))(l -h nd)Smax)- 

Substantial research (see e.g., [1,16,18,20]) has been reported in the litera- 
ture concerning the scheduling of multithreaded computations, ignoring though 
space requirements and communication costs. Burton shows in [9] how to limit 
space in certain parallel computations without causing deadlock. More recently. 
Burton [8] has developed and analyzed a scheduling algorithm with provably 
good time and space bounds. Blelloch et al. [2,3] have also recently developed 
and analyzed scheduling algorithms with provably good time and space bounds 
for languages with nested fine-grained parallelism (that is, languages that lead to 
series-parallel DAGs). All these algorithms are analyzed only for shared- memory 
machines and do not account for communication cost. On the opposite, our anal- 
ysis holds even for distributed- memory machines; more significantly, we provide 
bounds on the communication complexity of our algorithm for such machines. 

Fatourou and Spirakis considered in [12] the case of /c-strict computations, 
that is, computations on which any dependency from a thread goes to some of 
the k ancestors of the thread in the activation tree. They present two scheduling 
algorithms for /c-strict multithreaded computations. Their algorithms work in a 
different way than the algorithm presented in this paper and require knowledge 
of k. On the contrary, the algorithm presented in this paper is based on new 
ideas, which employ timestamps and maintains extra and more complicated 
data structures in order to keep track of related threads, that is threads that 
communicate a lot and thus should be placed on the same processor. Proving 
particular properties on the structure of these data structures appears to be a 
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major challenge of our analysis. These properties does not hold for the algorithms 
presented in [12], so that those algorithms are not appropriate for the more 
general case of strict computations. 

Recently, Arora et al [1] have proved that the work-stealing algorithm pre- 
sented in [5] can be analyzed assuming general multithreaded computations. 
However, their analysis provides only an execution time bound and no bounds 
for the space complexity. In contrast, our algorithm provides bounds on both ex- 
ecution time and space complexity, albeit for the less general class of strict mul- 
tithreaded computations. Additionally, the analysis provided in [1] applies only 
for shared memory computers, while our analysis applies even for distributed 
memory machines; more significantly, we provide bounds on the communication 
complexity of our algorithm for such machines. 

The algorithm presented in [5] has been implemented in CILK [4] which 
is a C-based language for programming multithreaded computations. Another 
tool that employs randomized work-stealing techniques as its load balancing 
mechanism is VDS [11]. 

The rest of this paper is organized as follows. Section 2 includes definitions 
and some preliminary facts, while Section 3 presents our algorithm, exhibits 
fundamental properties maintained by the algorithm, and provides bounds for 
its space complexity, its execution time and its communication cost. Many of our 
proofs have been omitted in this extended abstract. They can be found in [13]. 

2 The Model 

Our definitions closely follows the ones presented in [5,12]. A multithreaded eom- 
putation is modeled as a graph G, which is called the instruetion graph of the 
computation. A multithreaded computation is composed of a set of threads^ each 
of which contains unit-time, serially executed tasks. The nodes of the instruction 
graph represent tasks of the multithreaded computation. Tasks are connected to 
each other via eontinue edges that determine the order in which they are ex- 
ecuted. During the course of a thread’s execution, the thread may create, or 
spawn^ ehild threads. A thread can spawn as many children as it likes to. If a 
task 7 of a thread A spawns another thread l 2 , a spawn edge begins from task 
7 and ends at the first task of thread l 2 in the instruction graph. Spawn edges 
organize the threads into a rooted tree, which is called the spawn tree of the 
computation. Call the thread that corresponds to the root of this tree the root 
thread of the multithreaded computation. In addition to the continue and spawn 
edges, the instruction graph may also contain dependeney edges. Dependency 
edges model data dependencies, like e.g., producer-consumer dependencies, and 
allow threads to synchronize. If the execution of a thread arrives at a “con- 
suming” task, before the execution of the corresponding “producing” task, the 
execution of the consuming thread stalls. Once the producing task executes, the 
dependency is resolved and the consuming task is enabled to proceed with its 
execution. For any multithreaded computation G, we denote by Ed{G) the set 
of the dependency edges of G. 
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continue edge 




Fig. 1. A multithreaded computation. 



Figure 1 presents a multithreaded computation. A thread in this graph is 
represented by a block of circles, where each circle represents a particular task 
of the thread. The horizontal edges in Figure 1 are continue edges. Spawn edges 
are represented by downward-pointing edges. Thus, thread lb is the parent of 
threads A and T4. Dependency edges are represented by curved arrows between 
tasks of different threads. For example, task 74 of thread Fq is a consuming task. 
It needs a data item that is produced by task 718 of thread Is. Thus, task 74 
can not be executed until after task 718. 

Clearly, the execution of a multithreaded computation is possible only if 
dependency edges do not produce cycles in the instruction graph. Thus, this 
graph must be a directed acyclic graph or dag. We denote by Ud the maximum 
number of dependency edges entering any specific thread. A thread dies when 
all its tasks have been executed; a dead thread is one that has died. A spawned 
thread that is not dead is alive. 

A multithreaded computation is said to be striet if all dependencies from 
one thread goes to ancestors of the thread in the spawn tree. A fully-striet mul- 
tithreaded eomputation is one in which all dependencies from a thread go to 
the thread’s parent. Multithreaded computations that contain any other kind of 
dependencies are called non- striet or general. Our example computation of Fig- 
ure 1 is a strict computation. If we remove the dashed dependency edges from the 
multithreaded computation of Figure 1 , it becomes a fully strict multithreaded 
computation. 

For any thread T of a multithreaded computation, we define its height to be 
the distance of thread F from the root thread in the spawn tree. Clearly, the 
height of the root thread is 0 . We can partition the spawn tree into levels, where 
level 0 contains only the root thread, level 1 contains all threads of height 1, and 
generally level i contains all threads of height i. For any thread T, we denote 
by level {F) the level that F is located in the spawn tree of G. Consider any 
thread F located at level j in the spawn tree. For any integer i < j, we define 
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the i- ancestor of F to be the ancestor of the thread located at level j — i in the 
spawn tree. For any integer i > 0, the i- descendant of F is defined in the natural 
way. 

For any multithreaded computation G and any two threads F^F'gG con- 
nected by a dependency edge e G Ed{G)^ we define the dependency height of 
edge e, denoted by he^ to be he = \level{F) — level{F')\. The dependency height 
of G, denoted /i(G), is defined to be h{G) = mnyieeEd{G){^e}> In the rest of this 
paper, we use the notation h instead of /i(G), that is, we ommit G whenever it 
is clear from the context. 

An execution schedule for a multithreaded computation G, denoted T(G, P), 
determines which processors of the parallel computer execute which instructions 
at each step. A valid execution schedule must satisfy all constraints imposed by 
the continue, spawn and dependency edges of the instruction dag of a multi- 
threaded computation. At any given step t of an execution schedule, a task is 
ready if all of its predecessors in the instruction dag have been executed. Only 
ready tasks may be executed. We say that an execution schedule maintains the 
busy-leaves property if at each time step, every leaf thread has a processor work- 
ing on it. 

When a task spawns a thread, it allocates an activation frame^ that is a block 
of memory, for use by the newly spawned thread. We denote by Smax{G) the 
maximum size of the activation frame of any thread of a multithreaded compu- 
tation G. We denote by S\{G) the minimum amount of space required for the 
execution of G on a one processor machine. The work of a computation G is the 
total number of tasks in the computation. The dag depth of a task in G is the 
length of the longest path of the instruction graph that terminates at the task. 
The dag depth of G is the maximum dag depth of any of its tasks. We denote 
by Ti(G) the work of the computation, since a one-processor computer can only 
execute one task at any time step. We denote by Too(G) the dag depth of G, 
since even with arbitrarily many processors each instruction on a path must 
execute serially. For any multithreaded computation G, a P-processor (P > 1) 
execution schedule for G may incur an amount of communication when running 
in a distributed way. Apart from the amount of communication needed for the 
“distribution” of threads among the different processors, communication may 
also be required for resolving dependencies among threads residing on differ- 
ent processors. We denote by G(T(G, P)) the total communication required for 
the execution of the multithreaded computation G by the execution schedule 
X{G,P). 

A scheduling algorithm for a multithreaded computation decides which pro- 
cessors execute which tasks at every time step; that is, for any multithreaded 
computation G and any integer P > 0, a scheduling algorithm Alg takes G and P 
as input and outputs a schedule AAig(G, P). Since more than one threads may be 
simultaneously ready in a single processor, a mechanism for scheduling threads 
within a single processor should be provided by any algorithm. Call this mecha- 
nism the internal scheduler. In order to keep more than one processors working, a 
scheduling algorithm must dynamically distribute work among processors. Thus, 
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a basic component of any scheduling algorithm is the external seheduler^ whose 
job is to schedule threads across different processors. A work-stealing external 
scheduler works as follows. When a processor runs out of work, it becomes a 
thief dind steals work from a vietim processor. The decision of who is going to be 
the victim processor may be taken either deterministically or in a randomized 
way, yielding to either deterministic or randomized algorithms, respectively. An 
algorithm should also provide a mechanism for deciding what actions are made 
when a dependency is resolved. Call this mechanism the dependeney resolver. 

In this paper, we consider only distributed scheduling algorithms, in which 
a copy of the above three mechanisms runs on each processor independently. 
If any of the three components of a scheduling algorithm works in a random- 
ized way, the algorithm produces a probability distribution over all execution 
schedules A(G, P). In this work, we concentrate on scheduling algorithms with 
randomized, work-stealing external schedulers. For each such algorithm Alg, we 
denote by TAig(G, P), ^Aig (G,P) and GAig(G, P) the random variables denoting 
the execution time, the total space required, that is the spaee eomplexity, and 
the communication incurred, that is the eommunieation eomplexity, during the 
P-processor execution of a multithreaded computation G by Alg. 

The work 0/ Alg on a multithreaded computation G is defined to be the 
number of tasks (instructions) that Alg executes during the execution of G on 
a P-processor parallel computer. Notice that the work of any scheduling al- 
gorithm, whether randomized or not, is always equal to Ti(G). We define the 
stealing time of Alg to be the random variable expressing the number of steal 
attempts occured during the P-processor execution of Alg on the multithreaded 
computation G. The waiting time of Alg is the random variable expressing the 
total time that processors wait due to contention caused by stealing at the ready 
lists of processors. 

We continue to describe fundamental analytical tools, which can be proved 
useful for analysing the complexity of work-stealing, scheduling algorithms. 

For any multithreaded computation G, the augmented instruetion dag G' 
contains the original graph as a subgraph, but it has some extra edges, which 
are called eritieal edges. For every set of tasks 7^ , 7^ and 7^ , such that (7^ , 7^ ) 
is a spawn edge and (7i,7m) is a continue edge, the critical edge (7m is also 
an edge of G' . 

For any time step t during the execution of a multithreaded computation G 
on a P-processor parallel computer, an unexecuted task 7 is said to be eritieal 
at time step t, if all tasks 7', such that there exists a path from 7' to 7 in G', 
have been executed by time step t. A eritieal path of tasks in G' is a maximal 
path in G' . We say that a critical path U{G) = (71, . . . ,7l) oeeurs during the 
execution of a multithreaded computation G according to some valid execution 
schedule T(G, P), if one and only one of its tasks is critical at each time step of 
the execution. 

A round of steal attempts during the execution of some computation G is 
a set of at least 6 hP but fewer than 6 hP + P consecutive steal attempts such 
that if a steal attempt initiated at time step t occurs in a particular round, then 
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all other steal attempts initiated at time step t are also in the same round. All 
steal attempts that occur during an execution can be partitioned into rounds 
as follows. The first round contains all steal attempts initiated at time steps 
1, 2, . . . , ti, where ti is the earliest time step such that at least 6hP steal attempts 
were initiated at or before ti. In general, if the ith round ends at time step 
then the (i + l)st round begins at time step ti 1 and ends at the earliest time 
step ti-^i > ti 1 such that at least 6hP steal attempts were initiated at time 
steps between ti 1 and inclusive. We say that a given round of steal 

attempts occurs while some instruction 7 is critical if all of the steal attempts 
that comprise the round are initiated at time steps when 7 is critical. 

For any multithreaded computation G, a delay sequence^ denoted by T>(G), 
is a 3-tuple {U{G),R^n) satisfying the following conditions: (1) U{G) = ( 71 , 
72 , • • • , 7 l) is a critical path; (2) i? is a positive integer; (3) 77 = (tti, tt^, 7r2,7T2, 

. . . ,7Tl, 7t^) is a partition of the integer R (that is, R = such 

that 7 t' G {0, 1}, for each 7 G [L]. We define the ith group of rounds to be the tt^ 
consecutive rounds starting after the rith round, where P '^j)- 

Assume that Alg is any work-stealing, scheduling algorithm and consider any 
execution schedule ^Aig (G,P), produced by Alg for the execution of G on a P- 
processor parallel computer. A delay sequence P(G) = {U{G), R^ Pt)) such that 
U{G) = ( 71 ,..., 7l) and 77 = (tti, tt^, . . . , ttl, tt^), is said to occur dm'mg the 
execution of G by some execution schedule AAig(G, P), if the critical path U{G) 
occurs during the execution and for each i G [7], all tt^ steal attempt rounds in 
the ith group occur while task 7 ^ is critical. 

3 A Randomized Distributed Scheduling Algorithm 

3.1 Description 

The algorithm is online, distributed and works in an asynchronous, randomized, 
work-stealing fashion. Each processor maintains a data structure of ready threads 
with two endpoints, top and bottom, which is called ready list. The ready list 
is always sorted, according to the height of the threads it contains. Threads 
of higher height occupy positions in the ready list closer to the bottom end. 
We further associate to each thread a timestamp. When a thread is spawned 
it receives a timestamp equal to the value of the local clock of the processor in 
whose ready list is placed. 

Some of the elements of the ready list of each processor are linked to one 
another in a way that they comprise a doubly linked list, called the enabling list. 
We will explain which elements of the ready list participate in the construction of 
the enabling list below. The enabling list has two endpoints, start and end, and 
it is always sorted according to the timestamps of threads it contains. Threads of 
larger timestamps reside in positions closer to the end of the enabling list. Each 
processor p maintains a pointer, called uptr^, which points to an appropriate 
element of p’s read list. We will explain how each processor maintains pointer 
uptr below. 
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Algorithm SRWS is a distributed algorithm; thus, each one of the processors 
runs its own internal scheduler, external scheduler and dependency resolver. 
However, these components operate in the same way on every processor. The 
algorithm starts with all ready lists empty, except from the ready list of the 
processor where the root thread is initiated. All enabling lists are initially empty, 
and all uptr pointers are initially null. 

The internal scheduler of some processor p pushes and pops threads in its 
processor’s ready list from the bottom, like if it were a stack. More specifically, 
the internal scheduler of p removes the bottomost thread of its processor’s ready 
list and starts work on it. If this bottomost thread belongs to the enabling list, 
the internal scheduler updates the appropriate pointers of the enabling list. If 
pointer uptr^ points to this thread, uptr^ is updated to null. The internal 
scheduler works on this bottomost thread until it spawns, stalls, dies or enables 
a stalled thread, in which case it performs according to the following rules: (1) 
If a thread Fi spawns a child thread Ij, then Fi is placed on the bottom of the 
ready list, and the internal scheduler continues work on Fj. If pointer uptr^ was 
null before the spawn, it is updated to point to Fi. (2) If a thread Fi enables a 
stalled thread Fj^ the dependency resolver runs. (3) If a thread Fi stalls or dies, 
its internal scheduler checks its ready list. If the list contains any threads, then 
the internal scheduler removes and begins work on the bottomost thread. If the 
bottomost thread belongs to the enabling list, the internal scheduler updates the 
appropriate pointers of the enabling list. If pointer uptr^ points to the bottomost 
thread, uptr^ is updated to null. If the ready list is empty, the external scheduler 
runs, and tries to obtain work from other processors. In case a thread enables 
another thread and dies, the dependency resolver runs first, before the actions 
for dying. 

The dependency resolver of each processor works as follows. If a thread Fi 
enables a stalled thread Fj^ the now ready thread Fj is placed in the proper 
position, according to its priority, at the ready list of T^’s processor. The times- 
tamp of thread Fj takes the value of the local clock at the current time step of 
the processor, in whose ready list Fi resides. Moreover, Fj participates in the 
construction of the enabling list (that is, it becomes one of its elements). 

The external scheduler of any processor works as a thief and tries to steal 
work from a victim processor, which has been chosen uniformly at random. The 
external scheduler queries the ready list of the victim processor^ and if it is non- 
empty it steals some thread of the processor’s ready list. To choose which thread 
to steal the thief processor works as follows. The timestamp of the first thread in 
the enabling list is compared to the timestamp of the thread pointed by pointer 
uptr and the thread of smallest timestamp between the two is stolen. If pointer 
start is null, the thread pointed by uptr is stolen. If pointer uptr is null, 
the thread pointed by start is stolen. If the thread pointed by uptr is stolen, 
pointer uptr is updated to point to the next element towards the bottomost 



^ In message-passing systems, a stealing attempt to the ready list of a victim processor 
can be implemented by sending a message to this processor. The external scheduler 
of each processor is then responsible to process this kind of messages. 
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thread of p’s ready list. If there is no such element, it takes the value null. If 
the ready list of the victim processor is empty, the thief processor selects another 
victim processor uniformly at random and this procedure is repeatedly applied 
till a victim processor with a non-empty ready list is found. When the external 
scheduler of a processor achieves to obtain work, the control pass to the internal 
scheduler of the processor again. 



3.2 Properties 

For each integer n > 1, denote [n] = {1, . . . ,n}. Our first proposition presents 
important properties of pointer uptr. Roughly speaking. Proposition 1 asserts 
that the threads comprising the enabling list of some processor p occupy the 
topmost positions in its ready list; moreover, pointer uptr^ points the topmost 
element in p’s ready list that does not belong in the enabling list. 

Proposition 1. Consider any strict multithreaded computation G and let P > 1 
be any integer. Assume that an arbitrary processor p G [P] is working on a thread 
at some given time step during the course of an execution of G according to some 
execution schedule Asrws(G, P). Let Pq be the thread that p is working on, let 
n be the number of threads in p’s ready list and let Pi,P 2 ,...,Pn denote the 
threads in p’s ready list ordered from bottom to top, so that Pi is bottomost and 
Pn is topmost. If pointer uptr is not null and points to thread Pm, threads in 
p ’s ready list satisfy the following properties: 

(1) for all i = (a) thread Pi-i is a child of thread Pi in the spawn 

tree; (b) thread Pi-i has larger timestamp than thread Pi; (c) thread Pi has 
not been worked on since it spawned thread Pi-i, so that thread Pi is not an 
element of the enabling list; 

(2) if n > m, for all i = m + 1, . . . , n — 1, n, thread Pi is an element of the 
enabling list. 

Next proposition establishes that a thread can be stolen from some proces- 
sor’s ready list only after all its ancestor threads of smaller timestamp in p’s 
ready list have been stolen. 

Proposition 2. Let G be any multithreaded computation and let P > 1 be any 
integer. Assume that an arbitrary processor p G [P] is working on a thread at 
some given time step t during the course of an execution of G by some execution 
schedule Asrws(G, P). Let P, P' be threads in p’s ready list such that thread P is 
an ancestor of thread P' and P has a smaller timestamp than P' . Then, thread 
P' can not be stolen from p ’s ready list, unless thread P is stolen. 

The following proposition presents fundamental properties of the structure of 
the ready list of any processor. These properties are very important for proving 
the performance bounds of our algorithm. 
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Proposition 3. Consider any strict multithreaded computation G and let P > 1 
be any integer. Assume that an arbitrary processor p G [P] is working on a thread 
at some given time step during the course of an execution of G according to some 
execution schedule A!sr\ns{G, P). Let Pq be the thread that p is working on, let 
n be the number of threads in p^s ready list and let Pi, P 2 , . . . , Pn denote the 
threads in p’s ready list ordered from bottom to top, so that Pi is bottomost and 
Pn is topmost. If n > h-\-l, then the threads in p’s ready list satisfy the following 
properties: ( 1 ) for all i = 1 , . . . ,n — h, thread Pi-i is a child of thread Pi in the 
spawn tree; (2) for all i = 1, . . . ,n — h — 1, thread Pi has not been worked on 
since it spawned Pi-i; (3) if i ^ {h-\-l, . . . , n}, for all I = 0, . . . ,i — h — l, thread 
Pi has larger timestamp than thread Pi. 

Sketch of Proof. The proof is by induction on execution time. For the basis case, 
where t = to, all claims hold vacously, since the only ready thread is the root 
thread. For proving the induction step we study several cases on the action taken 
by the task executed at some processor p one time step before the considered 
time step. The most difficult case to consider is when the task enables a stalled 
thread. We use Propositions 1 and 2 to prove a collection of statements, the most 
important of which are the following: (1) the enabled thread is placed in one of 
the h topmost positions of p’s ready list; (2) for any thread P, which has been 
placed in p’s ready list at some time step t for the last time, no /-descendant of 
P, where / > h, exists in p’s ready list at time step t; moreover, (3) processor p 
does not commence work- stealing after time step t, while simultaneously (4) it is 
not possible any such descendant to be placed in p’s ready list by enabling. Thus, 
all these descendants should be spawned in p’s read list. We then prove that due 
to particular properties of both processor’s external stealers and dependency 
resolvers, as well as due to the way timestamps and pointers uptr are updated, 
no such thread can be stolen from p’s ready list. We use the above facts to prove 
the stated claims. 

We continue to prove important properties of critical tasks. We start by 
proving that during the execution of a multithreaded computation by algorithm 
SRWS, a critical task, residing in the ready list of some processor p, is either an 
element of p’s enabling list or pointer uptr^ points to it. 

Proposition 4. Consider a strict multithreaded computation G and let P > 1 be 
any integer. At every time step during the course of an execution of G according 
to some execution schedule Tsrws(G, P), each critical task must be the ready 
task of a thread that either is an element of the enabling list of some processor 
p or pointer uptr^ points to it. 

The following proposition states that during the execution of any strict mul- 
tithreaded computation by Algorithm SRWS on a parallel computer with P 
processors, any critical task executes after at most {2h + 1) steal requests have 
been serviced on the ready list of the processor that the thread containing this 
task resides. 
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Proposition 5. Consider any strict multithreaded computation G and let P > 1 
be any integer. Assume that 7 is the critical task of a thread P residing in 
the ready list of processor p G [P], at some time step during the course of an 
execution of G according to some execution schedule Asrws(G, P). Then, after 
at most (2/1 + 1) steal requests have been serviced on processor p, task^ executes. 

Sketch of Proof. We prove that there exist at most h ancestors and at most 
h descendants with larger timestamp than the thread that 7 belongs, in one 
processor’s ready list. Moreover, we prove that after 2 h steal attempts have 
been serviced by processor p, all these threads have been stolen and either 7 has 
already been executed or it is to be stolen next, as needed. 

We next prove that during the execution of any strict multithreaded compu- 
tation by Algorithm SRWS on a P-processor parallel computer, the probability 
that r > 1 rounds of steal attempts occur while any particular task is critical is 
exponentially decreasing with hr. 

Proposition 6. Consider any strict multithreaded computation G and let P > 1 
be any integer. Then, for any task 7 and any integer r > 1 , the probability that r 
rounds occur during the course of an execution of G according to some execution 
schedule Asrws(G, P), while 7 is critical is at most . 

We finally bound the probability that a particular delay sequence occurs 
during the execution of a strict multithreaded computation by Algorithm SRWS. 

Proposition 7. Consider any multithreaded computation G and let P > 1 be 
any integer. Assume that V{G) = {U{G),R,TI) is any arbitrary delay sequence 
such that U{G) is a critical path of length L. Then, the following holds for 
the probability that V{G) occurs during the execution of G according to some 
execution schedule Asrws(G, P).* ^[^^{G) occurs] < ^ 

3.3 Complexity 

In this section, we present upper bounds on the performance of Algorithm SRWS. 

Theorem 1. Consider any strict multithreaded computation G and let P > 1 be 
any integer. Then, (1) 5'srws(G, P) < S\P; (2) for any e > 0, with probability at 
least l — e, the following holds: Tsrws(G, P) G 0(Ti/P + /iToo + logP + log(l/e)); 
moreover, £{Tsr\ns{G,P)) G 0(Ti/P + /iToo); ( 3 ) for any e > 0, with probability 
at least I - e, Csrws(G,P) G 0{P{hToo + log(l/e))(l + nd)Smax); moreover, 
£{Gsr\Ns{G,P)) G 0{PhToo{l a rid)Smax)- 

Sketch of Proof. We prove that all execution schedules produced by SRWS main- 
tain the busy-leaves property. It has been proved [ 4 ] that all execution schedules 
that maintain the busy-leaves property use only linear expansion of memory. In 
order to bound the execution time of our algorithm, we bound seperately its 
work, its stealing time and its waiting time. Then, we add up these three factors 
and divide by P. Clearly, the work of SRWS on G is Ti(G). For bounding the 
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stealing time, we prove that during the execution of any multithreaded compu- 
tation if a large number of steal attempts are initiated, a delay sequence occurs. 
However, Proposition 7 implies that the probability of a delay sequence to occur 
is exponentially decreasing with the number of rounds of steal attempts. Thus, 
with high probability, a large number of steal attempts does not occur. We use 
the bound derived for SRWS^s steal time to bound its communication complex- 
ity. In order to bound the waiting time of SRWS, we use a combinatorial balls 
and bins game introduced and anlyzed in [5]. 

4 Conclusion 

We have presented a provably good randomized, work-stealing algorithm for 
strict multithreaded computations. We have analyzed the performance of our 
algorithm in terms not only of execution time, but also of space and communi- 
cation complexity, and we have proved that our algorithm is arguably efficient in 
terms of all these parameters. Our analysis generalizes the one presented in [5] 
and applies for both shared- memory and message-passing systems. Thus, our 
work answers one of the major open problems raised in [5], namely, whether 
there exists any provably efficient scheduling algorithm for the general case of 
strict multithreaded computations. 
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Abstract. This paper describes the ability of asynchronous shared- 
memory distributed systems to solve the consensus problem in a wait-free 
manner if processes are permitted to perform transactions on the shared 
memory in a single atomic action. It will be shown that transactional 
memory is often extremely powerful, even if weak types of shared objects 
are used and the transactions are short. Suppose T is a type of shared 
object. For any positive integer m, the transactional type trans(T,?n) 
allows processes to perform up to m accesses to a collection of objects of 
type T in a transaction. The transaction may also include internal process 
actions that do not affect the shared memory. For any non-trivial type T, 
trans(T,?n) can solve consensus among processes. A stronger 

lower bound of Q(2"^) is given for a large class of objects that includes 
all non-trivial read-modify- write types T. If the type T is equipped with 
operations that allow processes to read the state of the object without 
altering the state, then trans(T, 2) is capable of solving consensus among 
any number of processes. This paper also gives a consensus algorithm 
for processes using trans(n-consensus, m) and a consensus algo- 

rithm for any number of processes that uses trans(test&set, 3). 



1 Introduction 

It is often difficult to design fault-tolerant algorithms for asynchronous dis- 
tributed systems and prove them correct. This difficulty arises from uncertainty 
about the behaviour of processes: they run at arbitrarily varying speeds and are 
subject to failures. One way to make the job of algorithm designers easier is to 
allow them to specify that certain groups of operations by one process are to 
be performed without interruptions by other processes [2,3,6,7,12,15]. Here, a 
general model of transactional shared memory will be considered. A transaction 
is a programmer-specified block of an algorithm that the scheduler is required 
to treat as a single atomic action. Typically, distributed systems do not pro- 
vide transactions as primitive operations, so it would be desirable to implement 
transactions in software from the more basic primitives provided. In this paper, 
it is shown that giving the programmer the ability to use transactions greatly 
increases the power of a shared-memory distributed system, where the power is 
measured by the system’s ability to solve the consensus problem in a wait-free 
way. In particular, this means that one cannot hope, in general, to implement 
wait-free transactions in software from the corresponding primitive operations. 
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This paper considers asynchronous shared-memory systems. A scheduler is 
free to interleave the steps of processes in an arbitrary way, and processes are 
subject to halting failures. All algorithms are required to be wait-free [4]: non- 
faulty processes must correctly complete their executions even if other processes 
fail or run at varying speeds. Processes communicate by accessing shared data 
structures called objects. Read/write registers, queues, and compare&swap 
objects are examples of different types of shared objects. A type of shared object 
can be specified formally as an I/O automaton [11]. The object has a state, a 
set of permissible operations, and a set of possible responses to operations. An 
operation updates the state of the object and returns a response to the process 
that invoked the operation, in accordance with a transition function that is part 
of the object specification. All objects considered in this paper are deterministic. 

For any object type T and any positive integer m, one can define the trans- 
actional type trans(T,m), where m is an upper bound on the length of a trans- 
action. It consists of a collection of objects of type T, called base ohjeets^ indexed 
by the natural numbers. Although a potentially infinite number of base objects 
are permitted, the number of base objects actually used by the algorithms in 
this paper is polynomial in the number of processes. The type T is called the 
base type of the transactional type. An operation on the transactional object 
(called a transaetion) can be thought of as a block of code where any execution 
of the code performs at most m operations on the base objects and performs 
no other shared- memory operations. More formally, a transaction is a collection 
of m (computable) functions /i, . . . , /m, where fi maps tuples oii — 1 responses 
from objects of type T either to “nil” or to an operation on an object of type T 
and a natural number. Thus, /i(ri, . . . , r^-i) gives the operation to be performed 
and the index of the base object to be accessed during the ith shared-memory 
access of the transaction when the responses received from the first i — 1 shared- 
memory accesses in the transaction are ri, . . . If, during some execution 

of the transaction, the function fi evaluates to “nil”, this indicates that the 
transaction should terminate after the first i — 1 shared- memory accesses. 

The consensus problem has been a very useful tool for comparing the power 
of different shared-memory systems. In the consensus problem, each process 
begins with an input value and the processes must all output the same value. 
This common output value must be the input value of some process. Herlihy [4] 
showed that objects that solve the n-process consensus problem can be used, 
with registers, to implement any other object type in a system of n processes. 
Thus, the ability of a shared object type to solve the wait-free consensus problem 
is an important measure of the type’s power to implement other types of shared 
data structures and to solve problems. This result led to the idea of classifying 
object types according to their consensus numbers [4,8]. The eonsensus number 
of an object type T, denoted cons(T), is the maximum number of processes that 
can solve consensus using objects of type T and registers, or infinity if no such 
maximum exists. In the latter case, the object type is called universal^ since it 
can be used to implement any object in a system with any number of processes. 
This paper studies how much the consensus number of a type T increases when 
processes are permitted to perform transactions instead of individual operations. 
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1.1 Related Results 

Afek, Merritt and Taubenfeld [1] defined another type of object, called a multi- 
object, where processes may perform a number of basic operations as a single 
atomic action. If T is an object type and m is a positive integer, the multi-object 
multi(T, m) consists of a collection of base objects of type T, indexed by the nat- 
ural numbers. ^ An operation on the multi-object is a set of up to m operations 
on the base objects. As is the case for transactional objects, a multi-object has 
a potentially infinite number of base objects, but algorithms designed for multi- 
objects typically use only a polynomial number of base objects. In contrast to 
the adaptive nature of transactional memory, the m accesses of a multi-object 
operation must be specified in advance. Thus none of the accesses to base ob- 
jects may depend on the results from other accesses within the same operation 
on the multi-object. In addition, the m operations that make up an operation 
on the multi-object must be applied to distinct base objects. Although the latter 
restriction is not a part of the definition of transactional object types, all of the 
algorithms in this paper do perform their operations within a transaction on 
distinct base objects. 

Afek, Merritt and Taubenfeld [1] determined cons(multi(T, m)) for several 
base types T. Ruppert [14] proved a lower bound on the consensus number of 
multi(T,m) that applies to any type T with cons(T) > 2, and a stronger bound 
in the case of readable base types T. These results are summarized in the fourth 
column of Table 1. Since multi (T, m) is a restricted form of trans(T, m), any lower 
bound on the consensus number of multi (T, m) is automatically a lower bound on 
the consensus number of trans(T, m). Jayanti and Khanna [9] considered a related 
kind of multi-object, where no bound was set on the number of base objects that 
could be accessed in an atomic action. They showed that the consensus number 
of such a multi-object is always either 1, 2 or infinity. 

A number of researchers have studied transactional objects for particular 
base types. Most of the research has been focussed on implementing versions of 
transactional multi-objects from universal object types. One type of universal 
primitive object that has been considered is the compare&swap object. It is 
equipped with the operation compare&swap{old^new)^ which updates the state 
of the object to nere, if and only if the current state of the object is old. Another 
type is the LL/SC object, which is equipped with two operations: the load-linked 
(LL) operation reads the value stored in the object, and the store- conditional{v) 
{SC) operation updates it to a new value v. However, a SC operation performed 
by a process succeeds if and only if the object has not been updated since 
the last LL by that process. Israeli and Rappoport [7] described a variety of 
implementations of versions of LL/SC and compare&swap objects where the SC 
and eompare&swap operations are generalized to access several objects in a single 
atomic action. Shavit and Touitou described how to use LL/SC objects for a 
non-blocking implementation of atomic transactions on read/write registers. 
(A non-blocking implementation [5] has a weaker notion of fault-tolerance than 



^ The type multi(T,m) has previously been denoted in the literature. 
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Table 1. Summary of results 



wait- freedom: there can never be an infinite execution in which no operation 
is completed.) Moir [12] gave a similar wait-free implementation. Attiya and 
Dagan [3] discussed a non-blocking implementation of operations that can access 
two base objects in a single action, using LL/SC objects as their primitive. They 
also showed that LL and SC operations can be used to solve a problem more 
efficiently if processes may perform operations on more than one base object in 
a single atomic action. 

1.2 New Results 

An object type is called trivial if it can be simulated without using shared mem- 
ory at all. For example, if the only operation permitted is a read operation, 
which simply returns the current state of the object, the object is trivial, since 
all read operations can be simulated by returning the initial state of the object 
as the response. An object type is called readable if it is possible to read the 
state of the object without altering the state. The read operation need not read 
the entire state in a single action: instead, it might be possible to read the state 
piece by piece, as in an array of registers, for example. Section 2 contains a 
proof that, for any non-trivial readable type T, trans(T, 2) is universal. Thus, 
in the case of readable objects, the transactional model is far more powerful 
than the multi-object model: a multi(register, m) object can only solve con- 
sensus among 2m — 2 processes [4] , and there are readable types whose consensus 
numbers increase only by a factor of 0(m) in a multi-object setting [14]. 

Section 3 studies a specific base type called n-consensus. The n-consensus 
object is equipped with an operation propose{v), which returns the first value 
proposed to the object. The object may be accessed at most n times and has con- 
sensus number n. Afek, Merritt and Taubenfeld showed that the consensus num- 
ber of the multi-object multi(n-consensus, m) is Q{n^/m). This result does not 
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extend to transactional objects: Sect. 3 shows that cons(trans(n-consensus, m)) 
is at least . 

Section 4 shows that cons(trans(T, m)) > for a very large class of 

base types T, which includes all non-trivial read-modify- write (RMW) types [10]. 
Many commonly studied objects including compare&swap objects, test&set ob- 
jects and f etch&add objects are RMW types. Section 4 concludes by considering 
the base type test&set, which is perhaps the simplest kind of RMW object. A 
test&set object has two states, 0 and 1. Its single operation, test&set sets the 
state to 1, and returns the old value of the state. This type has consensus num- 
ber two [4]. In fact, even multi(test&set, m) has consensus number two, for 
any m [1]. However, trans(test&set, 3) is universal. 

Section 5 gives a general lower bound of on cons(trans(T, m)) for 

any non-trivial base type T. 

The results about consensus numbers of transactional objects are summa- 
rized in the last column of Table 1. The type is a readable version of the n- 
consensus type, and the toggle type is a simple three-state readable object [14]. 
The V operator used in the table takes two object types as operands. The type 
Ti V T 2 describes an object that behaves either as an object of type Ti or as an 
object of type T 2 , depending on its initial state. Using Ti V T 2 as the base type 
in a transactional setting effectively allows transactions that can access objects 
of both types. Thus, there is no real loss of generality in studying transactional 
objects that use a single base type instead of a collection of different base types. 

2 Readable Base Objects 

An object is readable if processes may read the object’s state without altering 
it. It is not necessary that processes be able to read the entire state in a single 
operation; instead they may be able to read it “piecewise” . Formally, a readable 

object O has a state set that is a Cartesian product Q =^(^Qki where F is 

ker 

an index set and Q/c is a set for each k e F. The sets F and Qk need not be 
finite. For each k e F^ processes may execute the operation read{0,k), which 
returns component k of the current state of O without changing the state of O. In 
addition to the read operations, the object may be equipped with an arbitrary 
set of other operations. These other operations are called update operations. 
It is assumed that each update operation can change only a finite number of 
components of the state. 

Any object type with a read operation that returns the entire state of the 
object is readable; in this case, |T| = 1. An array of registers whose elements 
can be read, copied or swapped atomically is another example of a readable type. 

For readable objects, it is possible to give a combinatorial characterization 
of the object types that are capable of solving consensus among n processes. A 
readable object is called n-universal [13] if a set of n processes can be partitioned 
into two non-empty teams and a single operation can be assigned to each process 
so that for any group of processes, if each process performs its own operation on 
an appropriately initialized object X of type T, then each process in the group 
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could determine which team accessed X first, provided that it could see the final 
state of X. A more formal definition is given below. 

Definition 1. A readable type T is n-universal if there exist 

• a state go C Q, 

• a partition of the set of proeesses {Pi, . . . , P^} into two non-empty teams A 
and B, and 

• an update operation opi, for 1 < i < n, 
sueh that 

for I < j < n, Raj H Rbj = 0; 

where Raj is the set of pairs (r, q) for whieh there exist distinet proeess in- 
diees ii^ . . . Aa ineluding j with Pi^ G A sueh that if ^ Pi^ eaeh per- 

form their operations (in that order) on an objeet of type T that is initially in 
state Qo, Pj gets the result r, and the objeet ends in state q. The set Rbj is 
defined similarly. 

The property of being n-universal characterizes the readable object types 
that have consensus number at least n. 

Theorem 2. [13] A readable type ean be used with registers to solve eonsensus 
among n proeesses if and only if it is n-universal. 

The universality of non-trivial readable types may now be proved. Recall 
that an object is trivial if it can be simulated without using shared memory. 

Theorem 3. For any non-trivial readable type T, cons(trans(T, 2)) = oo. 

Proof. Since T is non-trivial, there must be some update operation op that 
changes the state of the object from some state q to some other state r. 

First it is shown that trans(T, m) is, itself, a readable type. It is possible to 
read each of the base objects (possibly in a piecewise manner), so it is possible to 
read any component of the transactional object’s state without altering the state. 
In addition, a transaction will satisfy the technical requirement of updating only 
a finite number of components of the state of the transactional object, since it 
accesses at most m base objects, and updates a finite number of components of 
the state of each of them. 

By Theorem 2, it suffices to show that trans(T,m) is n-universal (as spec- 
ified in Definition 1) for every natural number n. The initial state go of the 
transactional object has every base object in state g. Partition a collection of n 
processes into two teams A = |Pi} and B = {P 2 , . . . , P^}. Let the base objects of 
the transactional object be denoted by Oi, O 25 • • •• The operation opi assigned 
to process Pi is simply an application of the update operation op to 0\. For 
i > 1, assign to process Pi a transaction as its opp. it first performs a read of 0\ 
and, if the state returned is g, it then applies op to Oi. 

Consider any execution where some group of processes each perform their 
assigned operations. If the process Pi on team A takes the first step, all of 
the objects O 2 , • • • , On will always remain in state g. However, if a process Pi 
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on team B takes the first step, it will change the state of Oi to r, and Oi will 
remain in state r for the rest of the execution. Thus, one can determine whether a 
process on team A took the first step from the state of the transactional object 
by checking whether all of the objects O 2 , • • • ,On are in state q. The object 
trans(T, 2) is therefore n-universal for any n. □ 

It follows from this theorem that a readable type T can be used to implement 
trans(T, 2) if and only if T is either trivial or universal. Furthermore, transac- 
tional objects can be much more powerful than ordinary multi-objects. For ex- 
ample, for any integer m greater than one, the multi-object multi(register, m), 
which has consensus number 2m — 2 [4], is incapable of implementing even 
trans(register, 2). 

3 Consensus Base Objects 

This section considers transactional memory that uses base objects of the type n- 
consensus (defined in Sect. 1.2). The algorithm presented here shows that the 
ability to use transactions causes an exponential increase in the consensus num- 
ber of this type. This contrasts with the multi-object setting, where Afek, Merritt 
and Taubenfeld showed that if the base objects of type T may be accessed by at 
most n processes, cons(multi(T, m)) is 0{n^/rn). 

The algorithm uses a collection of tree data structures. Processes access these 
trees by starting at a leaf and working towards the root. Each node contains an n- 
consensus object. The nodes at one level of the tree act as a filter to control 
access to the nodes at the next higher level. The first value proposed to the first 
tree’s root becomes the output of all processes. However, since each base object 
in the transactional object can be accessed at most n times, information about 
the output value must be carefully distributed to all processes using a series of 
other trees that use similar filtering mechanisms to control access to the nodes. 
The algorithm given here will be adapted in Sections 4 and 5 to prove more 
general results. 

Proposition 4. The transactional object trans(n-consensus, m) can he used to 
solve consensus among processes. 

Proof. When n = 1, the result is trivial, so assume that n > 2. A consensus 
protocol for N = processes. Pi, ... , P/v, will be constructed using a trans- 

actional object of type trans(n-consensus, m). Arrange the base objects into 
trees as follows. The first tree, T, is a complete n-ary tree of height m — 2. 
Each of the remaining trees, ioi 1 < k < n and 1 < j < consists 

of a root with one child which is, itself, the root of a complete n-ary tree of 
height m — 3. The tree T will be used to determine the output of the consensus 
protocol. The remaining trees will be used to distribute this information to all 
of the processes. A part of the data structure is shown in Fig. 1. The triangles 
represent complete n-ary trees of height m — 3. 

Divide the processes into groups of size n. Associate each group with 

a different leaf of the tree T. Let Gk be the set of processes that are 
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Fig. 1. Accesses to the transactional object by one process 



associated with leaf descendants of the kth child of T’s root. The trees T^, for 
I < j < will be used to distribute information to the processes in Gk- 

Associate each of the n-process groups in Gk with one of the leaves of T^. 

The group is also associated with the corresponding leaf in each of the trees , 
for 2 < j < . 

Process Pi in group Gk begins a transaction in which it will access base 
objects of the tree T. The process will access several nodes in the tree, each time 
proposing its own process identifier, i, and receiving a response. The process is 
said to “win” at a node if the response it receives is the value it proposed, and it 
is said to “lose” at the node if it receives some other response. First, it accesses 
the leaf of T with which its group of processes has been associated. If it wins 
at the leaf, process Pi goes on to access the parent of the leaf, again proposing 
the value i. It continues in this way, traversing the path from the leaf to the 
root until it loses. If the process loses at some node below the root, it ends its 
transaction. If it does reach the root. Pi proposes its own input value (instead 
of i) to the object located there. Let v be the value that the process Pi receives 
as a response from the root. The process then proposes the value v to the root 
of T^, terminates its transaction, outputs v and halts. The access to the root 



320 Eric Ruppert 



of has the effect of copying v into that object: any future access to that object 
will return the value v. 

So far, only the processes that accessed the root of T know the outcome. 
However, this outcome has also been stored in the roots of the trees T^. In an 
attempt to retrieve this information, each active process in group Gk performs 
its second transaction on T^. It executes the same algorithm on this tree as it 
performed on T. (If the process does reach the root of T^, it copies the out- 
put value into the root of T^.) The process continues in this way, performing 
transactions on T^, T^, and so on, until it successfully reaches the root of a 
tree. Whenever a process accesses the root of a tree T^, it immediately copies 
the value stored there to the root of the next tree, as part of the same 

transaction, outputs the value and then halts. All outputs of this protocol will 
be the input value of the first process to perform a transaction. 

Figure 1 shows the accesses to the transactional object by some process of 
group Gi in one execution. The solid circles represent nodes where the process 
wins. The circles marked with an “X” represent the nodes where it loses. In this 
example, the process performs five transactions. In the first, it attempts to gain 
access to the root of T by working its way up from a leaf, but fails. The process 
then tries unsuccessfully to gain access to the roots of , Tf and Tf in its next 
three transactions. In the final transaction, it successfully reaches the root of Tf 
and then accesses the root of Tf directly. 

It must be shown that the algorithm terminates. Suppose the process Pi 
attempts to access one of the trees T^, but loses before reaching the root. Then 
some other process must have performed its transaction on the tree before P^. 
The first such process will successfully reach the root and terminate. Thus, each 
time process Pi fails to access the root of a tree, some other process in Gk 
terminates. There are — 1 processes in Gk that attempt to access the 

information from the trees in the set {T^ • 1 < J since one process 

in Gk gets the information directly from the root of T. Thus, process Pi must 
successfully access the root of one of the — 1 trees T^. 

Since any path from a leaf to a root of one of the trees has m — 1 nodes, 
no transaction contains more than m operations on the shared memory. It must 
also be checked that no more than n processes access any base object, since 
the behaviour of the base objects is well-defined only under this restriction. For 
I < j < at most one process will access the root of during its {j + l)th 

transaction, since only one process will win at the child of the root. At most 
one process will access the root of during its first transaction, since only one 
of G/c’s processes can reach the root of T. For 1 < j < at most one process 

will access the root of during its jth transaction, since only one process can 
reach the root of in its jth transaction. Thus, at most two processes access 
the root of any tree . The leaf of any tree is accessed only by the n processes 
in the group associated with that leaf. Any other internal node is accessed only 
by those processes that win at the children of the node, and there are at most n 
such processes. □ 
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4 Read-Modify- Write Base Types 

Consider an object type whose set of possible states is Q. Any function f \ Q Q 
defines a RMW operation [10] that updates the state of the object by applying 
the function / and then returns the previous state of the object. If all opera- 
tions on the object have this form, the object type is called a RMW type. This 
class of types includes many important types of objects, such as compare&swap, 
test&set and f etch&add objects. 

To study the consensus power of transactional objects built from RMW types, 
it is helpful to use a special case of the consensus problem. The team(ni,n 2 ) 
problem is defined to be a restricted version of the general consensus problem 
among rii + U 2 processes where the processes are divided (in advance) into two 
non-empty teams of sizes rii and U 2 and all processes on a team receive the same 
input value. A tournament algorithm was used to prove the following lemma, 
which says that this restricted version of the consensus problem is just as hard 
as the general consensus problem. 

Lemma 5. [13] Suppose objects of type T and registers can he used to solve 
the team(ni,n 2 ) problem for some positive integers n\ and ri 2 . Then objects of 
type T can be used, with registers, to solve consensus among n\ +ri 2 processes. 

Herlihy showed that any non- trivial RMW object type has consensus number 
at least two [4]. An algorithm similar to the one given in Proposition 4 for the 
case where n = 2 can be used to establish an exponential lower bound on the 
consensus number of transactional memory built from any non-trivial RMW 
object type. In fact, the lower bound applies to an even more general class of 
objects. It is applicable whenever two processes can each apply a single operation 
and immediately know which operation occurred first. (It will be shown below 
that any non-trivial RMW type has this property.) 

Theorem 6. Let T be any type. Suppose there exist two (not necessarily differ- 
ent) operations op^ and op^, and a state q of type T so that op^ and op^ each 
return different responses depending on the order that the two operations are 
performed on an object initially in state q. Then, cons(trans(T, m)) > 2 '^~^ . 

Proof. For i = 0, 1, let be the response returned by op^ when an object of type 
T is in state q, and R[ be the response returned by op^ when it is preceded by 
the other operation, op^_^. The hypothesis requires that Rq Rq and Ri R!^. 

Consider a system with 2"^“^ processes. Partition the set of processes into 
two non-empty teams, A and B, each with 2"^“^ processes. By Lemma 5, it 
suffices to show that trans(T,m) can solve the team(2’^“^, 2"^“^) problem. 

Arrange the base objects into trees and assign groups of processes to each leaf 
as in the proof of Proposition 4 for the case where n = 2. Processes on team A 
should be assigned to the left subtree of T, and processes on team B should be 
assigned to the right subtree of T. Initialize all base objects to the state q. 

The team consensus protocol will mimic the operation of the algorithm given 
in Proposition 4 with n = 2. However, instead of agreeing on an input value. 
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the algorithm will be used to agree on the team of the first process to access 
the transactional object. Once this can be accomplished, it is easy to solve the 
team(2’^“^, 2"^“^) problem: each process first writes its input value into a regis- 
ter belonging to its team, and when the identity of the winning team is known, 
the value stored in that team’s register is returned as the output. 

To simulate the filtering action of a node below the root of a tree, one process 
that accesses the node performs op^ and the other process performs op^. If the 
process that performed op^ receives the response then it has won at that 
node. If it receives the response i?', then it has lost at that node. The two 
processes that access the root of T determine which team accessed the root first 
in exactly the same way. 

To simulate the operation that stores a value into the root of (during 
the jth transaction performed by a process), the process applies the operation 
opo to the object if and only if it wants to record the fact that a process from 
team B took the first step. If it wants to indicate that a process from team A 
took the first step, it does nothing. To simulate the operation that reads the 
contents of the root of (during the {j + l)th transaction performed by a 
process), the process applies the operation opi to the base object located at the 
root. It interprets the response Ri as indicating that a process from team A 
went first, and the response R[ as indicating that a process from team B went 
first. 

Clearly, each transaction contains at most m operations on base objects. The 
correctness of the algorithm can be shown in exactly the same way as in the proof 
of Proposition 4. □ 

Corollary 7. For any non-trivial RMW type T, cons(trans(T, m)) > . 

Proof. Since T is non-trivial, there exists some operation, op, that applies a 
function / with f{q) ^ q for some state q. Otherwise, one could trivially simulate 
T without using shared memory by returning the initial state of the object as the 
response to every operation. The operation op can be used as both operations 
opq and opi of Theorem 6. If op is performed when the object is in state q^ 
it returns the response q. However, if op is performed when another op has 
already been performed on an object in state q^ the second op returns a different 
response, f{q). The lower bound follows from Theorem 6. □ 

The test&set object defined in Sect. 1.2 is perhaps the most basic non-trivial 
RMW type. The following proposition demonstrates that this very simple type 
becomes even more powerful in the transactional setting than Corollary 7 sug- 
gests. The algorithm used in the proof has a different flavour from the algorithm 
in the previous proof. Once a base object has been accessed, further accesses 
never change its state, so it is not necessary to carefully control access to the 
base objects to avoid erasing information stored in them. 

Proposition 8. The transaetional objeet trans(test&set, 3) is universal. 

Proof. Let n be any integer greater than one. An algorithm will be given that 
uses trans(test&set, 3) to solve the team(l,n — 1) problem. It follows from 
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Lemma 5 that the consensus number of trans(test&set, 3) is infinity. Here, 
it is described how every process can determine which team first accesses the 
transactional object. Once this can be done, it is easy to solve team consensus 
using two additional registers, as in the preceding proof. 

Divide the n processes into two teams A = {Pi} and B = {P 2 , . . . , Pnj. The 
algorithm uses 2 n + 1 base objects, labelled Hi, . . . , Pi, . . . , and C, all of 
which are initially in state 0 . 

The object C is used to determine which team wins: every process will even- 
tually discover which team accessed C first. The other objects are used to dis- 
tribute the information about the winning team to all of the processes. First, the 
method of distributing this information will be described informally. Processes 
access the objects Hi, . . . in order so that H^+i is never set to 1 while H^ 
still has value 0. The same comment applies to the objects Pi, . . . , P^. Consider 
some moment in the computation. Let a be the largest index such that H^ has 
been set to 1. Let b be the largest index such that P 5 has been set to 1. The 
information about the first team to access object C is stored in the other base 
objects by maintaining the following invariant after each complete transaction. 

Invariant: If a process from team H accessed the transactional object first, 
then a = 6 + 1. If a process from team P accessed the transactional object first, 
then 6 = a + 1 . 

Each process retrieves information about which team first accessed the base 
object C by accessing pairs (H^, Bi) for increasing values of i until it finds a pair 
where only one of the two base objects is set to 1 . 

The algorithm will now be described in detail. Process Pi performs a single 
transaction. It first performs a test&set operation on object C. If it receives the 
response 1 , Pi knows that some other process has already accessed C, so it can 
conclude that a process on team P accessed the transactional object first, and 
it need not perform any further actions. On the other hand, if it receives the 
response 0 , then it knows it is the first process to access the transactional object. 
It then performs a test&set operation on object Hi to ensure that the invariant 
holds and performs no further actions. 

The algorithm for a process on team P is more complicated. In the first trans- 
action, the process performs a test&set operation on the base object C. If the 
result is 0 , it knows that it is the first process to access the transactional object. 
It then performs the operation test&set on B\ as part of the same transaction in 
order to satisfy the invariant. If the process receives the result 1 from C, it knows 
that some other process has already accessed the transactional object. The pro- 
cess must then use the other base objects to determine which team made the 
first access. The process performs a number of transactions, accessing Hi and Pi 
in the first transaction, then H 2 and P 2 in the second transaction, and so on, 
until it gets different responses from two objects H^ and P^. Suppose that H^ 
returns 1 and Bi returns 0, indicating that a process from team H accessed the 
transactional object first. To ensure that the invariant remains true, the pro- 
cess performs a test&set operation on H^+i as part of the same transaction. It 
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can then halt, knowing that team A performed the first transaction. The case 
where Ai returns 0 and Bi returns 1 is symmetric. 

It must be checked that this algorithm does terminate. The first process to 
perform a transaction that accesses the pair of objects Ai and Bi will halt at 
the completion of that transaction. Thus, at most n — i processes on team B 
will perform the transaction that accesses Ai and Bi. Thus, every process will 
perform at most n transactions. It is easy to see that the invariant is true after 
each complete transaction, and the correctness of the algorithm follows. □ 

The algorithm given in this proof can easily be adapted to work for any base 
type that is equipped with an operation whose first invocation returns a response 
different from any subsequent invocation’s response. 

5 A General Bound 

Theorem 6 gives a lower bound on the consensus numbers of transactional objects 
that applies to a very large class of base types. In this section it will be shown 
that a property which is weaker than the hypothesis of Theorem 6 and is satisfied 
by any non-trivial base type can be used to prove a weaker, but still exponential, 
lower bound on consensus numbers of transactional objects. The hypothesis of 
Theorem 6 assumed that two processes can each perform a single operation so 
that both processes can immediately tell which of the two performed its operation 
first. The following theorem shows that a similar algorithm will solve consensus 
even if only one of the two processes can determine which process accessed the 
base object first. Because of the weaker assumption, processes will have to access 
two base objects at each node, and this has the effect of halving the tree heights 
and producing a corresponding reduction in the number of processes that can 
be assigned to the leaves. 

Theorem 9. For any non-trivial base type T, cons(trans(T, m)) > . 

Proof. First, the non-triviality of T will be used to prove that T has a useful 
property, similar to the hypothesis of Theorem 6. Since T is non-trivial, the 
response to some operation, opq cannot depend entirely on the state in which 
the object is initialized. Thus, there is some (shortest) sequence of operations 
that takes the object from a state where it would return the response R to op^ 
into a state where it would return a different response R' to opQ. Let opi be the 
last operation in this sequence, and let q be the state of the object just before opi 
is applied. If an object begins in state q and two processes apply the operations 
opo and opi, the process applying opo can tell whether the other process has 
already taken its step, by checking whether it gets the response R or R' . 

Now, consider a system with processes, partitioned into two 

teams A and B of equal size. As in the proof of Theorem 6, it suffices to give an 
algorithm that allows every process to determine whether a process from team A 
or from team B accesses the transactional object first. 

The base objects are organized into trees as in the proof of Proposition 4 for 
the case when n = 2, except the height of the trees are reduced to [(m — 3)/2j, 
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and nodes of the trees have two base objects instead of one. (However, the roots 
of the trees need only one base object.) The two objects at a node will 
be referred to as the left and right objects of that node. All base objects are 
initialized to the state q. Associate groups of two processes with the leaves of 
the trees as in Proposition 4. 

The algorithm to determine the team containing the process that performed 
the first transaction will mimic the algorithm of Theorem 6. To simulate the 
filtering action of the nodes below the roots of trees, one process performs op^ 
on the left object and op^ on the right object, and the other process performs op^ 
on the left object and opQ on the right object. If a process receives the response R 
from opq, then the process wins at the node, and if the response is i?', then the 
process loses at the node. The winning team at the root of T is determined in 
the same way. 

To simulate an operation that writes the winning team to the root of , a 
process performs opi on the object at that node if and only if it wants to indicate 
that a process from team B took the first step. Otherwise it does nothing. To 
simulate the operation that reads the contents of the root of T^, the process 
performs opQ on the object in the node. The response R indicates that team A 
took the first step and the response R' indicates that team B took the first step. 

In the worst case, a transaction accesses all base objects on some path from 
a leaf to a root of a tree plus one base object in another tree. Each transaction 
therefore contains at most 2([(m — 3)/2j + 1) + 1 < m memory accesses, as 
required. The proof of correctness is the same as in Proposition 4. □ 

Corollary 10. 7/T is neither trivial nor universal^ there is some m sueh that T 
eannot provide a wait-free implementation o/ trans(T, m). 

Proof. If m = 2 log 2 (cons(T)) + 3, then cons(trans(T, m)) > cons(T), so 

trans(T, m) is strictly more powerful than T. □ 

6 Open Questions 

It has been shown that the ability to perform even short transactions can greatly 
increase the consensus numbers of object types. Can corresponding upper bounds 
on consensus numbers of transactional objects be proved? 

In the multi-object model, memory accesses within an atomic operation must 
be performed on different base objects. In the transactional model, processes are 
free to access base objects repeatedly during an atomic action, but this ability 
was not used in the algorithms given here. Is there any case where the ability to 
access the same base object repeatedly in a transaction really does help? 

Another possible setting for transactional objects would be one where each 
transaction must perform all of the shared-memory accesses on a single base 
object. How does such a restriction affect the consensus power of objects? In 
the case of test&set base objects, transactions on a single base object give no 
additional power, since any such transaction has the same effect as doing a single 
test&set opeicAion. Therefore, this single-object version of transactional memory 
is strictly weaker than general transactional objects. 
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Abstract. The cost of using message-passing to implement linearizable 
read/write objects for shared memory multiprocessors with drifting clocks 
is studied. We take as cost measures the response times for perform- 
ing read and write operations in distributed implementations of virtual 
shared memory consisting of such objects. A collection of necessary con- 
ditions on these response times are presented for a large family of as- 
sumptions on the network delays. The assumptions include the common 
one of lower and upper bounds on delays, and bounds on the difference 
between delays in opposite directions. In addition, we consider broadcast 
networks, where each message sent from one node arrives at all other 
nodes at approximately the same time. 

The necessary conditions are stated in the form of “gaps” on the values 
that the response times may attain in any arbitrary execution of the 
system; the ends of the gap intervals depend solely on the delays in a 
particular execution, and on certain fixed parameters of the system that 
express each specific delay assumptions. The proofs of these necessary 
conditions are comprehensive and modular; they consist of two major 
components. The first component is independent of any particular type 
of delay assumptions; it constructs a “counter-example” execution, which 
respects the delay assumptions only if it is not linearizable. The second 
component must be tailored for each specific delay assumption; it derives 
necessary conditions for any linearizable implementation by requiring 
that the “counter-example” execution does not respect the specific delay 
assumptions. 

Our results highlight inherent limitations on the best possible cost for 
each specific execution of a linearizable implementation. Moreover, our 
results imply lower bounds on the worst possible such costs as well; inter- 
estingly, for the last two assumptions on mesage delays, these worst-case 
lower bounds are products of the drifting factor of the clocks and the 
delay uncertainty inherent for the specific assumption. 



1 Introduction 

Shared memory has become a convenient paradigm of interprocessor communi- 
cation in contemporary computer systems. Perhaps this is so due to its combined 
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features that, first, it facilitates a natural extension of sequential programming, 
and, second, it is more high-level than message-passing in terms of semantics. 
This convenience has favored the evolution of concurrent programming on top 
of shared memory for the solution of many diverse problems. Thus, supporting 
shared memory in distributed memory maehines has become a currently major 
objective. 

Unfortunately, implementing shared memory in a distributed memory ma- 
chine encounters a lot of complications; these complications are due to the high 
degree of parallelism and the lack of synchronization between dispersed pro- 
cessors, that are both inherent in a distributed architecture. This necessitates 
the explicit and precise definition of the guarantees provided by shared memory 
implemented this way; such definition is called a eonsisteney eondition. Lineariz- 
ability is a basic consistency condition for concurrent objects of shared memory 
due to Herlihy and Wing [7]. Informally, linearizability requires that each op- 
eration, spanning over an interval of time from its invocation to its response, 
appears to take effect at some instant in this interval. The use of linear izable 
data abstractions simplifies both the specification and the proofs of multiple 
instruction/multiple data shared memory algorithms, and enhances composi- 
tionality. 

In this work, we continue the study of the impact of timing assumptions 
on the cost of supporting linearizability in distributed systems; this study has 
been initiated by Attiya and Welch [2], and continued further by Mavronicolas 
and Roth [12], Chaudhuri et al [3], Friedman [5], and Kosa [9]. We consider 
a distributed system that introduces non-negligible timing uncertainty in two 
significant ways: first, in the synchronization with respect to real time of each 
individual process, and, second, in the communication among different processes. 

Following previous work [2,3,5,9,12], we consider a model consisting of a 
collection of application programs running concurrently and communicating 
through virtual shared memory, which consists of a collection of read/write ob- 
jeets. These programs are running in a distributed system consisting of a collec- 
tion of processes located at the nodes of a communication network. The shared 
memory abstraction is implemented by a memory eonsisteney system (MCS), 
which uses local memory at each process node. Each MCS process executes a 
protocol, which defines the actions it takes on operation requests by the appli- 
cation programs. Specifically, each application program may submit requests to 
access shared data to a corresponding MCS process; the MCS process responds 
to such a request, based, possibly, on information from messages it receives from 
other MCS processes. In doing so, the MCS must, throughout the network, pro- 
vide linearizability with respect to the values returned to application programs. 

We take as cost measures the response times for performing read and write 
operations on read/write objects in such a distributed system. However, a first 
major diversion from previous works [2,3,5,9,12] addressing these particular cost 
measures is that we show bounds on them that hold for each specific execution of 
the system, while bounds established in previous work on the same cost measures 
hold only for the worst execution. Recent research work in distributed computing 
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theory has addressed bounds that hold for each specific execution in the context 
of the clock synchronization [1,13] and connection management [8,11] problems. 
A common argument in support of showing such “per-execution” bounds is that 
for certain kinds of assumptions on network delays, the costs for the worst-case 
execution may, in fact, have to be unbounded [1], while one may still want to 
award algorithms that achieve costs that are the best possible for each specific 
instance [1]. 

A second major diversion from previous related work [2,3,5,9,12] is with re- 
spect to assumptions on message delays; all that work has considered the rela- 
tively simple case where there are lower and upper bounds on message delays. 
Under this assumption, linearizable implementations of shared memory objects 
have been designed [3,5,12], whose efficiency depends critically on the existence 
of tight lower and upper bounds on message delays. This assumption, however, 
may not always apply, since it is often the case that there do not exist tight lower 
and upper bounds on message delays, while there is some other relevant infor- 
mation about the delays. We draw excellent motivation from the work of Attiya 
et al. [1] on clock synchronization under different delay assumptions to study 
the problem of implementing linearizable read/write objects in message-passing 
under the following assumptions on message delays (considered in [1]): (1) There 
is a lower and an upper bound on delays, d — u and respectively. (2) There is a 
bound 5 on the difference between delays in opposite directions; this assumption 
is supported by experimental results revealing that message delays in opposite 
directions of a bidirectional link usually come very close (cf. [1]). The clock syn- 
chronization problem has been already studied under this assumption [1]. (3) 
There is a bound (3 on the difference between the times when different processes 
receive a broadcast message; this assumption is useful for broadcast networks 
that are used in many local area networks. The clock synchronization problem 
has been studied under this assumption in [1,6,14]. 

A third major diversion from previous related work is with respect to the 
amount of synchronization of processes to real time. While that work [2,3,5,9,12] 
has assumed “perfect” (non-drifting, but possibly translated) clocks to be avail- 
able to processes, we allow a small “drift” on the processes’ clocks; the impact 
of this assumption on the time complexity of distributed algorithms has already 
been studied for the clock synchronization problem (see, e.g., [13]), and the con- 
nection management problem [8,11]. 

The main contribution of our work is a systematic methodology for prov- 
ing necessary conditions on the response times of read and write operations, 
that hold for each specific execution of any linearizable implementation, under 
a variety of message delay assumptions, and allowing a small “drift” on the 
processes’ clocks. This methodology yields a collection of corresponding nec- 
ssary conditions. Our proof methodology is modular, and consists of two major 
components. The first component is independent of the specific type of delay 
assumptions, while the second one addresses each such type in a special way. 

In more detail, the first component starts with a linearizable execution that 
is chosen in a different way for the write operation, the read operation, and 
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their combination, respectively. In each of the three cases, we use the technique 
of “retiming,” originally introduced by Lundelius and Lynch for showing lower 
bounds for the clock synchronization problem [10], to transform this execution 
into another possible execution of the system that is not linearizable. The trans- 
formation maintains the view held by each process in the original execution to 
the result of the transformation; moreover, the clocks in the latter are still drift- 
ing. Roughly speaking, retiming is used to change the timing and the ordering 
of events in an execution of the system, while precluding any particular process 
from “realizing” the change. 

The second component is tailored for each specific assumption on message 
delays. More specifically, the starting point of the second component is the result 
of transforming the original execution, and the corresponding message delays in 
this result. For each specific assumption on message delays, we insist that the 
resulting delays confirm to the assumption. This yields corresponding upper and 
lower bounds on the response time of the read and write operations, as a function 
of the message delays in the original linearizable execution. 

Our lower and upper bounds highlight inherent limitations on the best possi- 
ble cost for each specific execution of a linearizable implementation, as a function 
of the message delays in the execution, and the parameters associated with each 
specific assumption on message delays. Moreover, our results imply also Q{p^e) 
and (3) worst- ease lower bounds on response times for both write and read 
operations for the bias model and the model of broadcast networks, respectively. 
These lower bounds indicate that the timing uncertainty in the drifting clocks 
model must multiply the delay uncertainty {e and /3, respectively) for each of 
these models. We have not been able to deduce a corresponding fact for the 
model with lower and upper bounds on delays. (However, for the special case 
where p = 1, our general results imply worst-case results that are identical to 
those in [2,12].) This model appears to be stronger than the previous two since 
it does not allow unbounded delays; we conjecture that linearizable implementa- 
tions allowing for response times o{p^u) for both write and read operations are 
possible for this model. 

2 Framework 

For the system model, we follow [2,12]. We consider a collection of applieation 
programs running concurrently and communicating through virtual shared mem- 
ory, consisting of a collection X of read/write objeets, or objeets for short. Each 
object X G X attains values from a domain^ a set V of values. We assume a 
system consisting of a collection of nodes, connected via a eommunieation net- 
work. The shared memory abstraction is implemented by a memory eonsisteney 
system (MCS), consisting of a collection of MCS processes, one at each node, 
that use local memory, execute some local protocol, and communicate through 
sending messages along the network. Each MCS process pi, located at node i, is 
associated with an application program Pp, pi and Pi interact by using eall and 
response events. 
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Call events at pi represent initiation of operations by the application pro- 
gram Pi; they are Readi(X) and Writei(X, for all objects X e ^ and values 
V e V. Response events represent responses by pi to operations initiated by 
the application program Pi; they are Returrii(X, -i;) and Ack^(X), for all objects 
X G X and values v G V. Mess age- delivery events represent delivery of a mes- 
sage from any other MCS process to pi. Message-send events represent sending 
of a message by pi to any other MCS process. 

For each there is a physical, real-time clock at node readable by MCS 
process pi but not under its control, that may drift away from the rate of real 
time. Formally, a eloek is a strictly increasing (hence, unbounded), piece- wise 
continuous function of real time 7 ^ : 5R ^ 5R. Denote 7 ^ the inverse of 7 . Fix 
any constant p > 1 , called drift. A p-drifting eloek, or drifting eloek for short, 
is a clock 7 ^ : 5R ^ 5R such that for all real times t\,t 2 C 5R with ti < ^ 2 , 
^1 P ^ ( 72 (^ 2 ) — 7 i(^i ))/(^2 — ^ 1 ) < P- Define p^ to be the drifting faetor of a p- 
drifting clock. The clocks cannot be modified by the processes. Processes do not 
have access to real time; instead, each process obtains information about time 
from its clock. The call, message- delivery and timer-expire events are called 
interrupt events. The response, message-send and timer-set events are called 
reaet events. 

Each MCS process pi is modeled as an automaton with a (possibly infinite) 
set of states, including an initial state, and a transition function. Each interrupt 
event at MCS process pi causes an application of its transition function, resulting 
in a eomputation step. The transition function is a function from tuples of a 
state, a clock time and an interrupt event to tuples of a state and sets of react 
events. Thus, the transition function takes as input the current state, the local 
clock time, and an interrupt event, and returns a new state, a set of response 
events for the corresponding application program, a set of messages to be sent 
to other MCS processes, and a set of timer-set events. A history for an MCS 
proeess pi with eloek ji is a mapping hi from 5R (real time) to finite sequences 
of computation steps by pi such that: (1) Eor each real time t, there is only a 
finite number of times t' < t such that the corresponding sequence of steps hi{t') 
is non-empty; thus, the concatenation of all such sequences in real-time order 
is also a sequence, called the history sequenee. (2) The old state in the first 
computation step in the history sequence is pfs initial state. (3) The old state of 
each subsequent computation step is the new state of the previous computation 
step in the history sequence. (4) Eor each real time t, the clock time component 
of every computation step in the sequence hi{t) is equal to 7 i(t). (5) Eor each 
real time t, there is at most one computation step whose interrupt event is a 
timer-set event; this step is ordered last in the sequence hi{t). ( 6 ) At most one 
call event is “pending” at a time; this outlaws pipelining or prefetching at the 
interface between pi and Pi. ( 8 ) Eor each call event, there exists a matching 
response event in some subsequent computation step of the history sequence. 

Each pair of matching call and response events forms an operation. The call 
event marks the start of the operation, while the response event marks its end. 
An operation op is invoked when the application program issues the appropriate 
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call event for op; op terminates when the MCS process issues the appropriate 
response for op. For a given MCS, an exeeution cr is a set of histories, one for 
each MCS process, such that for any pair of MCS processes pi and Pj^ i J, 
there is a one-to-one correspondence between the messages sent by pi to pj and 
those delivered at pj that were sent by pi. Use this message correspondence to 
define the delay of any message in an execution to be the real time of delivery 
minus the real time of sending. By definition of execution, a zero lower bound 
and an infinite upper bound hold on delay. Define to be the set of delays of 
messages from MCS process pi to MCS process pj in execution e. Two executions 
are equivalent [ 10 ] if each process has the same history sequence and associated 
local clock times in both. Intuitively, equivalent executions are indistinguishable 
to the processes, and only an “outside observer” with access to real time can tell 
them apart. 

We continue with specific assumptions on the delays, borrowing from [1,13]. 
Each assumption gives rise to a particular delay model with an associated set 
of admissible executions. The assumption of lower and upper bounds on the 
delays [2,5,12] places a lower and an upper bound on the delay for any message 
exchanged between any pair of processes. Fix some known parameters u and d, 
{) < u < d < oo] u the delay uneertainty, while d is the maximum delay. 
Execution a is admissible if for each pair of MCS processes pi and pj^ for every 
message m in a from pi to pj, dcr(m) G [d — u^d]. 

The assumption of bounds on the round trip delay bias [1, Section 5.2] re- 
quires that the difference between the delays of any pair of messages in opposite 
direction be bounded. Fix any constant 5 > 0, called the delay uneertainty. For- 
mally, an execution cr is admissible if for any pair processes pi and pj^ and for 
any pair of messages m and m' received by pi from pj and received by pj from pi , 
respectively, \da{ra) — dcr(m')| < 5 . 

The assumption of multicast networks has been studied in [1,4,6,14] in the 
context of the clock synchronization problem; our presentation follows [ 1 , Section 
5.3]. To define this assumption, we replace message-send events by events of 
the form BroadcasU(m) at the MCS process p^, for all messages m; such events 
represent a broadcast of m to all MCS processes. The definition of an execution 
is modified so that for any pair of processes pi and pj^ i 7 ^ j, there is a one-to- 
one correspondence between the messages broadcast by pi^ and those delivered 
at Pj and broadcast by pi. Use this message correspondence to define the delay 
of message m to proeess pj in exeeution cr, denoted dcr(m,pj), to be the real time 
of delivery at pj in a minus the real time of broadcast by pi in cr. Fix any 
constant /^ > 0, called the broadcast accuracy. Execution a is admissible if for 
any process pi, for any message m broadcast by pi, |dcr(m,pj) — dcr(m,p/c)| < (3; 
that is, m reaches pj at most [3 time units later it reaches p/c, and vice versa. 

Each object X has a serial specification [7], which describes its behavior in 
the absence of concurrency and failures. Formally, it defines: (1) A set Op{X) 
of operations on X, which are ordered pairs of call and response events. Each 
operation op G Op{X) has a value val{op) associated with it. (2) A set of legal 
operation sequences for X, which are the allowable sequences of operations on X. 
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For each process pi^ Op{X) contains a read operation [Readi(X), Return^(X, i;)] 
on X and a write operation [Writei{X^ v), Aeki{X)] on X, for all values v eV; v 
is the value associated with each of these operations. The set of legal operation 
sequences for X contains all sequences of operations on X for which, for any read 
operation rop in the sequence, either val(rop) = T and there is no preceding 
write operation in the sequence, or val(wop) = val(rop) for the latest preceding 
write operation wop. A sequence of operations r for a collection of processes 
and objects is legal if, for every object X G A, the restriction of r to operations 
on X, denoted r | X, is in the set of legal operation sequences for X. 

Given an execution a, let ops (a) be the sequence of call and response events 
appearing in a in real-time order, breaking ties for each real time t as follows: 
First, order all response events whose matching call events occur before time t, 
using process identification numbers (ids) to break any remaining ties. Then, 
order all operations whose call and response events both occur at time t. Preserve 
the relative ordering of operations for each process, and break any remaining ties 
using again process ids. Finally, order all call events whose matching response 
events occur after time t, using process ids to break any remaining ties. An 
execution a specifies a partial order on the operations appearing in a: for 
any operations opi and op 2 appearing in cr, opi op 2 if the response for opi 

precedes the call for op 2 in ops (a); that is, opi op 2 if opi completely precedes 
0P2 in ops (a). Given an execution cr, an operation sequence r is a serialization 
of cr if it is a permutation of ops(cr). A serialization r of cr is a linearization of cr 
if it extends that is, if opi op 2^ then op^ — ^ op 2- Let r be a sequence 
of operations. Denote hy r \ i the restriction of r to operations at process pi] 
similarly, denote by r | X the restriction of r to operations on the object X. For 
an execution cr, these definitions can be extended in the natural way to yield 
ops (a) I i and ops (a) \ X. An execution cr is linearizahle [ 7 ] if there exists a legal 
linearization r of cr such that for each MGS process pi^ ops (a) \ i = r \ i. An 
MGS is a linearizahle implementation of X if every admissible execution of the 
MGS is linearizable. 

The efficiency of an implementation M of X is measured by the response 
time for any operation on an object X G X. Given a particular MGS A and a 
read/write object X implemented by it, the time |op^(X, cr)| taken by an op- 
eration op on X in an admissible execution cr of X is the maximum difference 
between the times at which the response and call events of op occur in cr, where 
the maximum is taken over all occurrences of op in cr. In particular, we denote 
by |R^(X, cr)| and |Wy^(X, cr)| the maximum time taken by a read and a write 
operation, respectively, on X in cr, where the maximum is taken over all occur- 
rences of the corresponding operations in cr. Define |R^(X)| (resp., |W^(X)|) 
to be the maximum of |R^(X, cr)| (resp., |W^(X, cr)|) over all executions cr of A. 

Fix e to be any execution, and let op = [Call(op), Response(op)] be any 
operation in e. We denote by ti^\op) and t^r\op) the (real) times at which 
Call(op) and Response(op), respectively, occur in e. We use val^^\op) to denote 
the value associated with the “execution” of operation op in e. 
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3 Writes 

A construction of a non-linearizable, if admissible, execution is presented in 
Section 3.1; this execution is used in Section 3.2 for deriving necessary conditions 
for the write operation under specific assumptions on the delays. We refer to any 
linearizable implementation A of read/write objects, including an object X with 
at least two writers pi and pj, and a distinct reader pk. 



3.1 A Non-Linearizable, if Admissible, Execution 

This construction is based on one in [2, Section 4] and [12, Section 5]. We start 
with an admissible execution e, in which pi writes Xi to X, then pj writes Xj to X, 
Xj 7^ Xi, and finally pk reads Xj from X; moreover, we assume that all clocks in e 
run at a rate of a for some constant a such that 1/p < cr < p. If p^’s history is 
shifted later, while p^’s history is shifted earlier, each by an appropriate amount, 
while both are either “stretched” or “shrinked” by a factor of a, depending on 
whether l<cr<porl/p<cr<l, the result is an execution e', not necessarily 
admissible, in which the write operation by pj precedes the write operation 
by Pi^ which, in turn, precedes the read operation by p/c. If, in addition, all 
clocks are correspondingly “stretched” or “shrinked” by the same factor of cr, 
all three processes still “see” the same events occurring at the same local time 
and cannot, therefore, distinguish between e and e'; thus, in particular, pk still 
reads Xj from X, which implies that e', if admissible, is not linearizable. We now 
present some details of the construction. 

By the serial specification of X, there exists an admissible execution e of A 
consisting of the following operations at processes pi, pj, and pk' Pi performs a 
write operation wop^ on X with ti^\wop^) = 0 and val^^\wop^) = xp, pj per- 
forms a write operation wopj on X with ti^\wopj) = \ wopi\ and val^^\wop j) = 

Xj] Pk performs a read operation ropk on X with t^c\ropk) = \wopi\ + 
max{|rcop/, |rcopj|}; apparently, max{|rcop/, Ircop^j} = |W^(X, e)|, so that 

t^c\ropk) = \wopi\ |W^(X, e)|. Moreover, assume that Ji^\t) = = 

= at for some positive constant a such that 1/p < cr < p, so that all 
clocks are p-drifting. (We omit reference to clocks of other processes in this 
extended abstract.) 

Since A is a linearizable implementation and e is an admissible execution, 
e is a linearizable execution. Thus, there exists a legal linearization r of e such 
that for each MCS process p, ops{e) | p = t | p. We use the construction of e to 
show simple properties of the sequence r, namely that wopi — ^ wopj, and that 
wopj Topk- Since r is a legal operation sequence, these properties imply that 
val^^\ropk) = val^^\wop j) = Xj. 

We now “perturb” the (admissible) execution e in order to obtain another ex- 
ecution e', which is not necessarily admissible; however, we shall show that if e' is 
admissible, then it is not linearizable. We construct e' as follows. 
(1) Set = t/c7 - C7 |W^(X,e)|, lf\t) = t/a + a\W px,e)\, and 
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7^^ ^ (^) = (2) For any process pi with clock 7^', define a mapping h[ from 5R to 

finite sequences of computation steps by pi as follows. Each step at pi associated 
with real time t in hi is associated with real time 7^^^ \ addi- 

tion, h[ preserves the ordering of steps in hi. (3) e' preserves the correspondence 
between message- delivery and message- send events in e. 

Since e is an execution of for each MCS process pi, hi is a history for pi 
with clock By rule (2), this implies that h[ is a history for pi with clock 
moreover, for any real times ti,t 2 C 5R with ti < ^2, \^ 2 ) ~ ^(^1) = 

(^2— ti)/cr. Since \ j p < o < pAl 9 ^ ^ 9^ ^i^ ^ is p-drifting; thus, by rule (3), 

it follows that e' is an execution of A. In addition, rule (2) immediately implies 
that executions e and e' are equivalent. We continue to establish a fundamental 
property of the execution e'. 

Lemma 1. Assume that e' is an admissible exeeution. Then, e' is not lineariz- 
able. 

Proof. We give a sketch of the proof. Since ^ is a linearizable implementation 
and e' is an admissible execution of A, e' is a linearizable execution. Thus, 
there exists a legal linearization r' of e' such that for each MCS process p, 
ops(e') \ p = r' \ p. We show simple properties of the sequence r', namely that 

wopj wop^ and wop^ '^opj^. Since r is a legal operation sequence, these 
properties imply that val^^ \rop^) = val^^ \wopf) = Xi. Since Xi 7^ Xj, it follows 
that val^^\rop^) 7^ val^^\rop^). However, the equivalence of e and e! implies 
that val^^\rop^) = val^^ \rop^). A contradiction. 



3.2 Results for Specific Models of Delays 

Our methodology is as follows. We first calculate message delays in execution e' 
(independent of specific delay assumptions). Next, we consider separately each 
specific assumption on delays; requiring that message delays in the execution e' 
constructed in Section 3.1 satisfy the assumption yields the admissibility of e', 
which, by Lemma 1, implies the non-linearizability of e! . For the model with 
lower and upper bounds on delays, we show: 

Theorem 1. Consider the model with lower and upper bounds on the delays. 
Let A be any linearizable implementation of read/write objeets, ineluding an 
objeet X with at least two writers pi and pj, and a distinet reader pk- Fix any 
parameters 6 ij, 5ji, 6 ik, Ski, Sjk, Skj > 0. Then, for any parameter a G [ 1 / 9 , 9 ], 
there exists an admissible exeeution e of A with Sij G , Sji G , Sik G 
A[l\Ski G A^^^ ,5 jk G A^^^,Skj G A^^^ , sueh that either 

|W^(X,e)| 

Sij d d — u Sji d d — u d — u d 

< max{— - - —,Sik - - Ski, - 5jk,Skj ~ , 
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or 



|W^(X,e)| 

. rSij d — u d ^ji c d — u d d d — u 



Proof. We give a sketch of the proof. Assume, by way of contradiction, that there 
exists a linearizable implementation A of read/write objects, including the ob- 
ject X, such that for any parameter <j G [1/p, p], for every admissible execution e 

of A with 6ij G A^j\Sji G A^j-\6ik G A\^ A ki ^ ^ki^^jk ^ ^^jk^^kj ^ "^kj ^ 
neither inequality holds. We establish that the execution e' constructed in Sec- 
tion 3.1 is an admissible execution of A; appealing to Lemma 1, this implies 
that e' is non-linearizable, which contradicts the fact that ^ is a linearizable 
implementation. (To prove that e' is an admissible execution, we show by case 
analysis that for any pair of processes pi and Pm, and for any message m received 
by Pm from pi, d^^'\m) G [d-u,d]. 



Theorem 1 establishes the existence of executions with “gaps” for the re- 
sponse times of write operations. For the model with a bound on the round-trip 
delay bias, we show: 



Theorem 2. Consider the model with a bound on the round-trip delay bias. Let 
A be any linearizable implementation of read/write objeets, ineluding an objeet X 
with at least two writers pi and pj, and a distinet reader pk- Fix any parameters 
5ij, 5ji, 5ik, Ski, Sjk, Skj > 0. Then, for any parameter a G [1/p, p], there exists 
an admissible exeeution e of A with Sij G a[^^ , Sji G , Sik G a[^ , Ski ^ 
^th^jk ^ ^ , sueh that either 



|W^(X,e)| < +max{^^^y^ 



S Skj Sj k 
4^’ 2 



£ 

4^’ 




or 



|W^(X,e)| > ^ + 



+ 



4(j2’' 



'^kj 



Ojk 



4(j2 ’ 4 ^ 



The proof of Theorem 2 is similar to the proof of Theorem 1, and it is omitted. 
Theorem 2 demonstrates the existence of executions with “gaps” on the response 
times of write operations. In order to derive a worst- ease lower bound on the 
response time for write operations from Theorem 2, we set a = 1/p, Sij — Sji = e, 
Sik— Ski = ^5 and Skj —Sjk = With these choices, the upper limit on |W^(X, e)| 
becomes negative, and, therefore, it cannot be met, which implies that the lower 
limit on |W^(X, e)| must be met, which is positive for these choices. We obtain: 



Corollary 1. Consider the model with a bound on the round-trip delay bias. 
Let A be any linearizable implementation of read/write objeets, ineluding an 
objeet X with at least two writers pi and pj, and a distinet reader pk- Then, 

|W^(X)| >A/4 + V4. 
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For the model of broadcast networks, we show: 



Theorem 3. Consider the model of broadeast networks. Let A be any lineariz- 
able implementation of read/write objeets, ineluding an objeet X with at least 
two writers pi and pj, and a distinet reader pk- Fix any parameters 6ij , 6ji, 
^ik, ^ki, ^jk, ^kj > 0- Then, there exists an admissible exeeution e of A with 



dij 
either 









^\k ’ dki 



^ki ’ djk 



^(e) 
^jk ’ 



'^kj 



4 ?. 



sueh that 



\WAX, e)\<-X+ max{5„ - Sik - X 5 .^ _ ^ 



or 






|Wx(^, e)| > + inin{(5y - Sik + Sjk - Sji + ^^3 ’ 2 



-}• 



The proof of Theorem 3 is similar to the proof of Theorem 1, and it is 
omitted. Theorem 3 demonstrates the existence of executions with “gaps” on the 
response times of writes operations. In order to derive a worst- ease lower bound 
on the response time for write operations from Theorem 3, we set cr = \jrho, 
dij — Sik = djk — Sji = P, and Skj — Ski = P- With these choices, the upper 
limit on |W^(X, e)| becomes negative, and, therefore, it cannot be met, which 
implies that the lower limit on |W^(X, e)| must be met, which is positive for 
these choices. We obtain: 



Corollary 2. Consider the model of broadeast networks. Let A be any lineariz- 
able implementation of read/ write objeets, ineluding an objeet X with at least 
two writers Pi andpj, and a distinet reader pk- Then, |W^(X)| > P j2^ P j2. 



4 Reads 

A construction of a non-linearizable, if admissible, execution is presented in 
Section 4.1; this execution is used in Section 4.2 for deriving necessary conditions 
for the read operation under specific assumptions on the delays. We refer to any 
linearizable implementation A of read/write objects including an object X with 
at least two readers pi and pj, and a distinct writer pk. 



4.1 A Non-Linearizable, if Admissible, Execution 

This construction is based on one in [2, Section 4] and [12, Section 5]. We start 
with an admissible execution e, in which pi reads T from X, then pj and pi 
alternate reading from X while pk is writing x to X, and finally pj reads x 
from X; moreover, we assume that all clocks in e run at a rate of cr, for some 
constant a such that 1/ p < cr < p. Thus, there exists a read operation rop^, say 
by Pi, that returns T and is immediately followed by a read operation ropi by pj 
that returns x. If p/s history is shifted later by |R^(X, e)|, while p/^ history 
is shifted earlier by |R^(X, e)|, while both are either “swelled” or “shrinked” 
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by a factor of cr, the result is an execution e! in which precedes rop^. If, 
in addition, all clocks are correspondingly “swelled” or “shrinked” by the same 
factor cr, all three processes still “see” the same events occurring at the same local 
time and cannot, therefore, distinguish between e and e'; thus, in particular, pj 
and Pi still read x and ± in their read operations ropi and ropQ^ respectively, in 
this order. This implies that e', if admissible, is non-linearizable. We now present 
some details of the construction. 

Let b = [|W^(X)|/2|R^(X)|] . By the serial specification of X, there exists 
an admissible execution e of ^ consisting of the following operations at pro- 
cesses Pi, Pj, and pk- For each integer /, 0 < / < 6 , pi performs a read operation 
ropp^^ on X; for each integer /, 0 < / < 6 , pj performs a read operation 
on X; pk performs a write operation wopi^ on X with val^^\wopj^) = x. For 
each /, 0 < I < 26+1, let rop^^^ = rop^P if I is even, or rop^ if I is odd. 
The definition of the call times of read operations in e is inductive. For the basis 
case, ti^\rop^^^) = 0 . Assume inductively that we have defined {rop^^^) where 
0 < / < 26+ 1. Then, = ti^\rop^^^) + \rop^^^\. Set also tc{wopf.) = 

|rop^^^|. Moreover, assume that "yp\t) = some con- 
stant cr such that 1 /p < cr < p, so that all clocks , and are p-drifting. 

(We omit reference to clocks of other processes in this extended abstract.) 

Since M is a linearizable implementation and e is an admissible execution of 
M, e is a linearizable execution. Thus, there exists a legal linearization r of e such 
that for each MCS process p/, ops{e) \ I = r \ 1. We use the construction of e to 
show simple properties of the operation sequence r, namely that ropf^^ — '^op^ 
and that wop^ rop^p^~^^\ We show that for each /, 0 < / < 26, rop^^^ — 
^^p(+i) These properties imply that there exists an index /o, 0 < /q < 26, such 
that rop^^^^ — wopj^ — rop^^°+^\ Since r is a legal operation sequence, this 
implies that val^^\rop^^^^) = + and = x. Assume, without loss 

of generality, that /q is even, so that rop^^^^ is a read operation by process p^. 

We now “perturb” the (admissible) execution e in order to obtain another 
execution e' which is not necessarily admissible; however, we shall show that 
if e' is admissible, then it is not linearizable. We construct e' as follows. ( 1 ) Set 

if \t) = - o-|R-^(Ve)|, jf \t) =t/a + cr|R^(X,e)|, and 7 ^® \t) = tja. 

(2) e' preserves the correspondence between message-delivery and message-send 
events in e. (3) For any process pi, each step at pi occurring at real time t in e is 
scheduled to occur at real time addition, e' preserves the 

ordering of steps in e. Since e is an execution of M, for each MCS process p/, hi 
is a history for pi with clock By rule (2), this implies that h[ is a history 
for Pi with clock ^ moreover, for any real times ti,t 2 G 5R with ti < ^ 2 , 

\t 2 ) — Ip ^(^i) = (^2 — Since 1 /p < cr < p, 1/p < 1 /cr < p, so that 

^ is p-drifting; thus, by rule (3), it follows that e' is an execution of A. In 
addition, rule (3) immediately implies that executions e and e' are equivalent. 
We continue to show a fundamental property of the execution e'. 
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Lemma 2. Assume that e' is an admissible exeeution. Then, e' is not lineariz- 
able. 



4.2 Results for Specific Models of Delays 

We consider separately each specific assumption on message delays; requiring 
that message delays in the execution e' constructed in Section 4.1 satisfy the 
assumption yields the admissibility of e', which, by Lemma 2, implies that e' is 
not linearizable. For the model with lower and upper bounds on the delays, we 
show: 



Theorem 4. Consider the model with lower and upper bounds on the delays. Let 
A be any linearizable implementation of read/write objeets, ineluding an objeet X 
with at least two readers pi and pj, and a distinet writer pk- Fix any parameters 
Sij, 6ji, Sik, Ski, Sjk, Skj > 0. Then, there exists an admissible exeeution e of 



(e) 



A with Sij e 
that either 



, 6 ji G A^jf , Sik G 



A^f} , Ski 



^ki ’ 



e /ijfe , Skj e A 



(e) 

kj ^ 



sueh 



|R^(^,e)| 

< max{^ — 



d d — u 

V’ V 



or 



Sji^ 

2 



d d — u 

^ik 9 5 9 



d — u d 

-^-SjkAkj - ^ 1 , 



. r Sij d — u d Sji d — u d d d — u 

> mm| — , -j—j — , Sik 5 , — y — Ski ^ ~2 ~ Skj 2 I ‘ 

2 2p^ 2p^ 2 p^ p^ p^ p^ 

For the model with a bound on the round-trip delay bias, we show: 



Theorem 5. Consider the model with a bound on the round-trip delay bias. Let 
A be any linearizable implementation of read/write objeets, ineluding an objeet X 
with at least two readers pi and pj , and a distinet writer pk . Fix any parameters 
Sii, Sii, Sik, Ski, Sjk, Skj > 0. Then, there exists an admissible exeeution e of 



(e) 



A with Sij G Af- 
that either 



Sji 



c A^-^ , Sik c 



a\^ , Ski 



c a[,} , Sik c A\jJ,Ski G A 



(e) 



^ki ’ t ^jk ’ ^kj 



V(e) 

kj ^ 



sueh 



|R^(X,e)| < -^ +max{ 



Sik Ski 
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V’ 



Skj - s 
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Sip__S^^ 

A J ’ 



or 



|R^(X,e)| < ^ +min{^^^ 
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V 



Skj Sj k 



^ ^ij ^ji A 
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Theorem 5 demonstrates the existence of executions with “gaps” on the re- 
sponse times of read operations. In order to derive a worst- ease lower bound 
on the response time for read operations from Theorem 5, we set a = 1/p, 
Sij — Sji = e, Sik — Ski = ^5 and Skj — Sjk = With these choices, the upper limit 
on |R^(X, e)| becomes negative, and, therefore, it cannot be met, which implies 
that the lower limit on |R^(X, e)|, which is positive for these choices, must be 
met. We obtain: 
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Corollary 3. Consider the model with a bound on the round-trip delay bias. 
Let A be any linearizable implementation of read/ write objeets, ineluding an 
objeet X with at least two readers pi and pj, and a distinet writer pk. Then, 
|R^(X)| >pV4 + e/4. 

For the model of broadcast networks, we show: 



Theorem 6. Consider the model of broadeast networks. Let A be any lineariz- 
able implementation of read/write objeets, ineluding an objeet X with at least 
two readers pi and pj, and a distinet writer pk- Fix any parameters 6ij , 6ji, 
^ik ; ^ki ; ^jk, 

either 



dkj 
i e 



> 0. Then, there exists an admissible exeeution e of A with 
^ A\l\Ski e e Aj!,5kj e such that 






\RAX, e)l < -^ + max{5„ - 6^^ + Xj., _ < 5 .. + , 



or 



e)| > -X -I- max{(5y - Sik + ~ _ 

Theorem 6 demonstrates the existence of executions with “gaps” on the re- 
sponse times of read operations. In order to derive a worst- ease lower bound 
on the response time for read operations from Theorem 6, we set cr = \jrho, 
^ij — ^ik = Pi ^jk — ^ji = Pi and 5kj — Ski = P- With these choices, the upper 
limit on |R^(X, e)| becomes negative, and, therefore, it cannot be met, which 
implies that the lower limit on |R^(X, e)| which is positive for these choices, 
must be met. We obtain: 

Corollary 4. Consider the model of broadeast networks. Let A be any lineariz- 
able implementation of read/write objeets, ineluding an objeet X with at least 
two readers Pi andpj, and a distinet writer pk- Then, |R^(X)| > p^Pl2-\- pj2. 
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Abstract. Many crucial network tasks such as database maintenance 
can be efficiently carried out given a tree that spans the network. By 
maintaining such a spanning tree, rather than constructing it ”from- 
scratch” due to every topology change, one can improve the efficiency 
of the tree construction, as well as the efficiency of the protocols that 
use the tree. We present a protocol for this task which has communica- 
tion complexity that is linear in the “actual” size of the biggest connected 
component. The time complexity of our protocol has only a poly logarith- 
mic overhead in the “actual” size of the biggest connected component. 
The communication complexity of the previous solution, which was con- 
sidered communication optimal, was linear in the network size, that is, 
unbounded as a function of the “actual” size of the biggest connected 
component. The overhead in the time measure of the previous solution 
was polynomial in the network size. 

In an asynchronous network it may not be clear what is the meaning 
of the “actual” size of the connected component at a given time. To 
capture this notion we define the virtual component and show that in 
asynchronous networks, in a sense, the notion of the virtual component 
is the closest one can get to the notion of the “actual” component. 



1 Introduction 

1.1 Motivation and Existing Solutions 

Maintaining a common database in the local memory of each node is a common 
technique in computing a distributed function. An important example is the 
case that the replicated information is the set of non- faulty links adjacent to each 
node. Maintaining replicas of this information is the classical ’’Topology Update” 
problem, where each node is required to ’’know” the description of its connected 
component of the network [Vis83,BGJ+85,MRR80,SG89,CGKK95,ACK90]. 
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This is one of the most common tasks performed in existing networks since, 
when the topology gets to be known to all nodes, many distributed tasks can be 
reduced to sequential tasks. This reduction is conducted by having each node 
simulate the distributed task on the topology known to it. An Example for a 
protocol which uses this approach is the Internet OSPF interior routing protocol 
[OSPF]. 

In [ACK90] it was shown that the incremental cost of adapting to a single 
topology change can be smaller than the communication complexity of the pre- 
vious approach [AAG87] of solving the problem ’’from scratch” . (A variant of the 
algorithm of [ACK90] was implemented later as a part of the PARIS networking 
project at IBM [CGKK95].) 

In this paper we improve the first subtask of [AGK90]: maintaining a span- 
ning tree in the dynamic network, thus, also improving the database maintain- 
ing algorithm. Our algorithm also improves any other algorithm which uses the 
maintenance spanning tree algorithm as a building block. The amortized com- 
munication complexity of the tree maintenance subtask in [AGK90] was 0(F), 
where V was the number of nodes in the network. In a dynamic network the 
value of this parameter might be much larger than the “actual” size of the biggest 
connected component. 

An example for such a scenario is when all the nodes in the network are 
separated from each other and only one edge recovers to create a two-node con- 
nected component. Therefore, The size of the network is F/2 times bigger than 
the “actual” size of the connected component. Moreover, under this scenario, 
according to the algorithm in [AGK90] , the two nodes of the connected compo- 
nent exchange 0(V) messages. Therefore, this scenario demonstrates that the 
amortized communication complexity in [AGK90] is not bounded as a function 
of the size of the biggest connected component in the network. 

The quiescence time of the algorithm of [AGK90] was high: O(V^), mainly 
because the algorithm merged only two trees at a time in each of its phases. 
Merging more trees in the same iteration of the algorithm of [AGK90] could 
have violated an important invariant (called the loop freedom invariant)^ the 
correction of the algorithm relied upon. This was, probably, the reason only two 
trees were merge at a time in [AGK90] . 



1.2 Our Solution 

In this paper we provide a tree maintenance protocol with amortized communica- 
tion complexity that is linear in the “actual” size A of the connected component 
in which the algorithm is performed (The message complexity of the previous 
solution was not bounded as a function of A.) The time complexity overhead of 
our protocol is only polylogarithmic in the “actual” size of the biggest connected 
component. (The time overhead of the previous solution was polynomial in the 
network size). 

In an asynchronous network it may not be clear what is the meaning of the 
“actual” size of the connected component at a given time. To capture this notion 
we define the virtual eomponent and show that in asynchronous networks, in a 
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sense, the notion of the virtual component is the closest one can get to the notion 
of the “actual” component. 

Topological changes may break a tree into several rooted trees. The execution 
of our algorithm is composed of iterations that are invoked continuously by every 
root of a tree, until the connected component is spanned. Each iteration of the 
algorithm is constructed from two phases: in the first phase, referred to as the 
Trees Approval phase, a set of trees is ’’chosen” and ’’prepared” for merging. 
In the second phase, referred to as the Trees Merge phase, the actual merge is 
executed by parallel invocations of the TREE MERGE procedure for every tree 
that was ’’chosen” in the Trees Approval phase. Some topological changes which 
occur during the algorithm execution may cause the algorithm to initiate the first 
phase while other topological change may not influence the algorithm execution. 
We show that this two-phase structure enables the algorithm to maintain the loop 
freedom invariant of [ACK90] even though our algorithm does merge more than 
two trees at a time. (This uses especially the fact that a non-” chosen tree” does 
not participate in the second phase even if an edge from it to a node in a chosen 
tree has recovered). In that way the algorithm improves the time bottleneck of 
the algorithm of [ACK90] . 

The communication improvement is achieved by using a new message ex- 
change policy: exchanging fewer messages between every two merging trees. Our 
algorithm exchanges only messages that describe the ’’approved” trees and not 
the whole local memory of every node as in the algorithm of [ACK90] . We show 
that sending fewer messages during tree’s merging can, in fact, cause additional 
messages to be sent at later times during the run of the algorithm. (This, proba- 
bly, is the reason the algorithm of [ACK90] did not try to economize on messages 
between trees at the time of merging). However, we manage to show that the 
number of these additional messages is small, leading to the improvement in the 
total number of messages. 

Another difference from the work of [ACK90] is an adaptation we had to 
make to one of the building blocks used in the algorithm of [ACK90]. This is 
needed to cope with our algorithm’s new message exchange policy. 

An additional change that had to be made in our algorithm is the addition 
of a new message, range deletion message, that instructs its receiver to update 
more than one item in its local database. We show that the use of that new 
message enables the algorithm to keep its time complexity as a function of the 
size of the “actual” connected component rather than a function of the number 
of topological changes that occurred. 

The rest of the paper is organized as follows: Section 2 describes the model 
and the problem. Section 3 presents a high level overview of the algorithm of 
[ACK90] and the algorithm of [AS91] (which is used as a building block in 
our algorithm). Section 4 presents Phase 1 of our algorithm (Trees Approval) 
and Phase 2 (Trees Update). Section 5 contains the correctness and complexity 
analysis. 
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2 The Problem 

2.1 The Model 

The network is represented by a graph G=(V,E) where V is the set of nodes, 
each having a distinct identity, and E C V x V is the set of edges, or links. We 
assume that each edge has a distinct finite weight. (If the edge weights are not 
distinct one simply appends to the edge weight the identities of the two nodes 
joined by the edge, listing, say, the lower ordered node first [GHS83]). 

The network is dynamic: edges may fail or recover. Whenever an edge fails an 
underlying lower-layer link protocol notifies both endpoints of this edge about 
the failure, before the edge can recover [MRR80,BGJ+85,AGG+90]. Similarly, 
a recovery of an edge is also notified to each of its endpoints. A message can be 
received only over a non- faulty edge. 

Messages transmitted over a non-faulty edge eventually arrive, or the edge 
will fail (in both endpoints). Messages arrive over any given edge according to 
the FIFO (First In First Out) discipline. The communication is asynchronous. 

The complexity measures that are used to evaluate the algorithm perfor- 
mance are: (1) Amortized Gommunication - the total number of messages (each 
containing 0(log V) bits) sent by the protocol, divided by the number of topology 
changes that occurred during the execution. (2) Quiescence Time - the maximum 
normalized time from the last input change until termination of the protocol. 
Normalized time is evaluated under the assumption [Awe88] that link delays 
varies between 0 and 1. This assumption is used only for the purpose of evalu- 
ating the performance of the algorithm, but is not used to prove its correctness. 



2.2 Problem Definition 

In response to topological changes^ namely, recoveries or failures of edges, the 
algorithm is required to mark, at each node, a subset of the node’s edges. The 
collection of edges marked by all of the network nodes is called, as in the algo- 
rithm of [AGK90], the real forest The requirement imposed on the algorithm 
is that the real forest is, in fact, a forest at all times. Trees in this forest are 
called real trees. If the input stops changing then the output (the real forest) is 
required to become eventually a spanning tree of the connected component. 

3 Background 

3.1 The Tree Maintenance Algorithm of [ACK90] 

A failure of a real forest edge causes the edge to become unmarked and discon- 
nects a real tree into two or more real trees. Our model allows multiply edges to 
fail simultaneously. 

In order to span the entire connected component, the algorithm of [AGK90] 
uses the following scheme to reconnect the tree, until it spans the connected 
component. Fvery real tree locates its minimum outgoing edge- the edge with 
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the minimum weight among all the edges leading from (nodes of) this real tree 
to (nodes of) other real trees. If the other endpoint is also the minimum outgoing 
edge of the other real tree, then it is guaranteed that the endpoints will agree . 
(Unless, of course, additional topological changes occur, e.g. failure of the min- 
imum edge). Therefore, at each algorithm iteration two distinct real trees are 
merged into a larger real tree, by marking the minimum edge that connects 
them. The high level description of the algorithm of [ACK90] appears in figure 
1 . 

Two main distributed subroutines are used in the algorithm (for the full list 
of the properties of two subroutines refer to section 5):(1) FIND- a subroutine 
which is used to find the minimum outgoing edge of the real tree using a dynamic 
data structure maintained by each node. (2) UPDATE- a subroutine that is used 
to update the mentioned dynamic data structure. The dynamic data structure 
at each node, is i;’s approximation of the real forest and is called i;’s forest 
replica. The approximation (a subset of ^’’s forest replica) of Node ^’’s real tree 
is called ^’’s tree replica. 

The UPDATE subroutine attempts to keep the tree replicas of all the nodes 
as ’’accurate” (i.e. close to the real forest) as possible. To this end a node that 
performs a change in the marking of its adjacent edges (unmarking as a result 
of a failure, or marking an edge that the algorithm decided to use to merge two 
trees) updates its forest replica, and communicates the change over the marked 
edges to its whole real tree. Observe that in the case that the forest replicas of the 
endpoints of an edge disagree, it is neither obvious how such an ’’agreement” is 
to be reached, nor which one of the endpoints is ’’more correct” ^ (better reflects 
the real forest). In order to resolve this conflict, the UPDATE subroutine is 
based on an idea, which was called the Tree Belief principle in [ACK90] . This 
principle is used between every two neighbors, in order to decide which of their 
forest replicas is ’’more correct” regarding every Edge e. In other words, which 
replica reflects a later (in history) status of Edge e. The two neighboring nodes 
apply the Tree Belief principle for every topological item that appears in their 
replicas, namely, for every edge (x, y). Consider a graph, U, which is the union of 
the tree replicas of nodes u and v. According to this principle. Node v is ’’more 
correct” about an edge (x, y) than its neighbor if in U the undirected path 
from u to edge (x, y) starts at Node v. (The intuition is that if Node v is not 
correct, an indicator for the error should have arrived, sooner or later, over the 
mentioned path). It is obvious that if for an edge (x, y) there appear two distinct 
undirected paths from one starting with edge {u, v)^ while the other does not, 
this principle could not have been applied. Therefore, the following invariant is 
enforced by the algorithm in order to realized this principle. 

Definition 1. The Loop Ereedom invariant ([ACK90]): For every real tree T , 
the union of the tree replicas of the nodes of T does not contain a cycle. 

^ This is due to the asynchronous nature of network (we do not assume a real time 
clock) and the bounded size of the message (we cannot aloud to number all the 
messages using a counter that counts to infinity). 
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The UPDATE subroutine conducts the correction of the replicas using the In- 
cremental Update technique. 

Definition 2. Incremental Update technique- The node with the correct data 
structure sends each neighbor node a message per error that appears in its neigh- 
bor's data structure. The message describes the place of the error in the data 
structure (thus, the neighbor can correct the error). 

This technique is based on the Neighbor- Knowledge assumption, namely that 
each node ’’knows” the content of the local memory of its network neighbors. 
This assumption holds for the algorithm of [ACK90] because the algorithm main- 
tains in each node’s local memory also an estimate of the forest replicas of its 
neighbors. Let Node Ks mirror of node v (denoted by Mirroru{v)) be the data 
structure, at Node u, that represents the estimate that Node u has for the replica 
of Ti’s neighbor v. 



Whenever a marked edge fails 

Unmark the edge (* at the endpoints *) 

Whenever two trees merge or a topological change occurs 
Call UPDATE (* correct tree replicas *) 
call FIND (* choose min outgoing edge *) 

Whenever two trees choose same min outgoing edge 
For each of the trees separately call UPDATE 
(* After UPDATE terminates: *) 

Mark the chosen edge at both of its endpoints (*merge*) 



Fig. 1. The main Algorithm of [ACK90]. 



3.2 The Maintenance-of-Common Data Algorithm of [AS91] 

In the TREE MERGE procedure, which is invoked in our algorithm’s second 
phase (described in section 4.2), a new version of the algorithm of [AS91] is used 
as a distributed procedure. The purpose of TREE MERGE is to update the 
different replicas at all the nodes of all the approved Trees with the topological 
changes. In the algorithm of [ACK90] this task was conducted, as mentioned 
earlier, using the UPDATE subroutine. Using a somewhat modified version of 
the algorithm of [AS91] techniques. Procedure TREE improves the time 

complexity of our algorithm. 

In this section we describe the intuition behind the original version of the 
algorithm of [AS91] (for full details see [AS91]). The model in [AS91] is a com- 
munication network of n H- 1 nodes arranged in a chain, each holding an m-bit 
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local input array. The first node in the chain (W.l.o.g the left node in the chain) 
is termed the broadcaster. The task of the algorithm is to write in the local 
memories of all the network nodes the value of the input of the broadcaster. The 
broadcaster’s array is considered to be the ’’correct” array and the broadcaster 
is in charge of ’’correcting” the other ’’wrong” arrays. We use the term thread to 
denote a single invocation of the algorithm that runs on a single network node at 
a time (but can migrate). The distributed algorithm is built from such threads 
which can migrate from one node to the node’s neighbor in a direction away 
from the broadcaster, namely, from a node to its right-hand neighbor. 

The algorithm places one thread, the first invocation of the algorithm, in 
charge of the entire protocol - that is, in charge of correcting all the local arrays 
using an Incremental Update technique (see definition 2). Namely, this thread 
starts correcting every error it ’’knows” about in its neighbor. Only one message 
per error is sent. Once its neighbor’s array is correct the thread moves on to that 
neighbor and continues its operating. However, this technique can be poor in 
terms of time complexity due to the fact that the Incremental Update technique 
puts sever restrictions on the use of pipelining. (Intuitively, only when a Node v 
’’knows” its neighbor had the opportunity to correct on item but declined, does 
V ’’know” that the item is correct; only then can v correct the next neighbor 
on the chain). To improve the time complexity of the Incremental Update, the 
algorithm of [AS91] adopts a (very weak) version of message pipelining: each 
invocation of the algorithm works recursively and creates two ’’child” threads. 
Each of the child threads is in charge of correcting half of its parent’s array 
in all the network nodes. The first child thread is in charge of correcting the 
lower half of its parent array and the second child is in charge of correcting the 
upper half of its parent array. The second child’s correction messages are sent 
only after the first child has finished sending its correction messages, therefore, 
keeping the increasing bit order of the correction according to the incremental 
update technique. After a thread finishes to correct the array it is in charge off 
it delegates to its neighbor. Therefore, the two threads move along the network 
nodes relative to one another in their work. The parent thread itself tags along 
just behind its children. In order to correct its smaller replica, a thread runs 
the same protocol as its parent thread. In particular a child thread can create 
children that are in charge of arrays that are still smaller than its own, and so 
forth. Splitting the correction of an array in that manner enables the two threads 
to work in parallel on different nodes and therefore, improves the algorithm time 
complexity. 

The way in which a thread creates a child threads is by sending a thread- 
carrier message to the right-hand node in the network. This message carries the 
indexes of the smaller array that the new child thread will be in charge of. Had 
each thread created child threads immediately, the message complexity of creat- 
ing these threads could have been a dominant factor in the protocol’s message 
complexity. In such a case there could be arbitrary more thread- carrier messages 
than error-correction messages. The algorithm of [AS91] avoids this problem by 
allowing a thread to create child threads only if it has corrected ” enough” errors 
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in its neighbor’s replica. In this manner the algorithm amortized the communi- 
cation costs associated with the creation of children on the communication cost 
of the correction messages. 

4 Our Algorithm 

4.1 Phase 1: Trees Approval 

As mentioned earlier, one of the bottlenecks in the time measure in the algorithm 
of [ACK90] is due to the fact that the algorithm merges only two trees at each 
iteration. A worst case scenario for this algorithm is a network that is arranged 
in a chain and in each iteration of the algorithm a tree consisting of a single 
node merges with the tree in the left side of the chain until in V iterations the 
whole network is spanned. In this scenario the algorithm’s quiescence time is the 
sum of an algebraic sequence- O(V^). The first phase of our algorithm, the Trees 
Approval, removes this bottleneck by enabling more than two trees to merge at 
a time, and still we manage to avoid violating the loop freedom invariant In 
this section we describe this phase in detail. The high-level description of the 
first phase of the algorithm appears in figure 2. For the sake of simplicity we 
describe our algorithm as a sequential algorithm (the distributed implementation 
is described in the full paper). 

Whenever a tree chooses a minimum outgoing edge it sends over that edge 
a request message, namely, a request to merge, directed at the tree over the 
other endpoint of the edge. This is the same technique used in the algorithm of 
[ACK90] (described in section 3.1). The different step taken in our algorithm is 
conducted whenever two trees have chosen to merge through the same outgoing 
edge, the eore edge. Node r, the node with the higher identity between the core 
edge’s endpoints, starts a broadcast wave of a registration message^. This mes- 
sage propagates over tree edges to all the nodes in r’s real tree. Let Nodes u and 
V be two neighbors that belong to different trees. When the registration message 
reaches u which already got a request message from Node u marks the request 
as granted and propagates the registration message also to v through {u,v). A 
merging request that arrives at u after the registration message is received at u, 
has to wait for the next registration message (namely, for the next algorithm it- 
eration) . The registration message is propagated transitively to all the real trees 
that their request messages arrived before the registration message (originated 
from r). We refer to these real trees as the approved trees. 

^ This registration message is substantially different from the initiate message of the 
algorithm of [GHS83]. In the algorithm of [GHS83] Node n that sends a eonneet 
message to Node n may merge with n^’s tree even though n has already received 
the initiate message- in certain cases. In our algorithm a tree always has to wait if 
its merging request has arrived after the registration message. This major difference 
arises from the fact that our model is dynamic. Our algorithm operates in the spirit 
of ” two-phase commit” protocols. It first attempts to ’’lock” (’’approve”) some of 
the trees and only after it stops ’’approving” it starts allowing them to merge. 
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When Node r is notified (by the standard Broadcast and Echo technique- the 
PIE algorithm [Seg83]) that the broadcast of the registration message has termi- 
nated, it broadcasts a start-update message to the roots of all the approved real 
trees to invoke the UPDATE’ a version of the UPDATE procedure of [ACK90]. 
This procedure is invoked separately by the root of each approved real tree T on 
receiving the start-update message. This invocation is responsible for updating 
T’s replica as it appears at T’s root at all of T’s nodes. 

The number of failures of edges may be arbitrarily larger than the size of 
a node’s real tree. Therefore, in order to ensure that the number of messages 
that the algorithm sends is a function of the size of the connected component, a 
change has to be made in procedure UPDATE. We add a new update message 
range deletion that carries one topological item {u^v). This message instructs 
its receiver Node w to delete all the topological items that appear in re’s tree 
replica lexicographically after the topological item that appeared in the previous 
eorreetion message. If the range deletion message is the first message that the 
node receives, the node deletes all the topological items that appear in its tree 
replica lexicographically before {u,v). 

When Node r detects, using the termination detection algorithm of [DS80] 
(the same technique is used in the algorithm of [ ACK90] ) that all the activation 
of UPDATE have terminated, it invokes the second phase of the algorithm- Trees 
Merge (describes in section 4.2). 



Whenever a marked edge fails 

Unmark the edge (* at the endpoints *) 

Whenever trees merge or a topological change occurs 
Call UPDATE’ (*correct tree replicas*) 

Call FIND (*choose min outgoing edge*) 

Whenever two trees choose the same min outgoing edge 

Broadcast &; Echo registration message (*approving trees*) 

Invoke procedure UPDATE’ separately for each of the approved trees 
Invoke Trees Merge (*second phase*) 



Fig. 2. The First Phase - Trees Approval Algorithm. 



4.2 Phase 2: Trees Merge 

Node r invokes the merge of all the approved trees (the last line of Eigure 2) by 
broadcasting a start- TREE MERGE message that propagates to all the nodes 
of the approved trees. Every local root u of an approved tree T that receives 
start- TREE MERGE message initiates an invocation of the TREE MERGE 
procedure. We refer to Node u as the hroadeaster of the procedure. Every TREE 
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MERGE procedure is an invocation of a new version of the algorithm of [AS91] 
where the ’’correct” replica is the tree replica as it appears in the broadcaster’s 
local memory. The broadcaster is responsible for updating all the local replicas 
in all the nodes of all the approved trees. Note that there are parallel broadcasts- 
each broadcast is initiated by a root of an approved tree. Every message of the 
procedure contains its broadcaster identity, therefore, every node that receives 
such a message can relate it to its appropriate invocation. Each procedure in- 
vocation starts when its broadcaster, say Node updates in parallel all of its 
neighbors in i^’s real tree. As a part of this update process u also sends its real 
tree replica over Edge (i^,n), over which it got the registration message. This is, 
in fact, an Incremental Update: sending the whole i^’s tree replica to a neighbor 
V when Mirroru{v) is empty. 

Recall that the algorithm in [ACK90] required that u sends the whole forest 
repliea, namely, sending also replicas of trees that u does not belong to them. 
Transmitting the real tree replicas, rather than the forest replica, violates the 
Neighbor Knowledge assumption- a node u no longer holds in its local memory, 
in Mirroru{v)^ an accurate copy of its neighbor n’s local replica. Therefore, it is 
possible that later u will transmit correction messages that describe topological 
items that already appear in n’s forest replica. Note that these sort of duplicate 
messages cannot be sent in the algorithm of [ACK90]. In section 5.2 we prove 
that these duplicates do not increase the order of the message complexity of our 
algorithm. 

As was mentioned earlier, the TREE MERGE procedure is a version of the 
algorithm of [AS91]. However, weakening the neighbor knowledge assumption, 
that the algorithm in [AS91] relies upon, requires an algorithmic step to be taken 
in our procedure whenever a node receives a eorreetion message. Assume that 
node w receives a eorreetion message from its neighbor. Node 2 :, that tells w to 
add an edge {x, y) to re’s forest replica. Assume further that adding that edge 
would connect two separate trees that appear in re’s forest replica. The new 
algorithmic step that is taken by w is to remove every edge (^, u) that appears 
in re’s forest replica but does not appear in the mirror of Node 2 :, Mirror^iz). 
When a broadcaster u, a root of an approved tree, is notified that the TREE 
MERGE procedure has terminated, u marks the Edge (i4, v) as a real tree edge. 
Node u also marks Node v as its parent. (This is done only if u is not the 
node with the higher identity of the two endpoints of the core edge- node r). 
Every node in every approved tree knows the number, DOWN-FLOW eounter, 
of approved trees to which it and its ancestors in its real tree propagate the 
registration message. If DOWN-FLOW eounter equals to zero, then u sends a 
termination message to its parent notifying the parent that the TREE MERGE 
procedure has terminated. A node w that the number of termination messages 
it gets is equal to DOWN-FLOW sends a termination message to its parent. 
When Node r is notified that all the TREE MERGE procedure invocations have 
terminated, r starts a new iteration by an invocation of a new search for a 
minimum outgoing edge for the new tree. 
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5 Analysis 

In this extended abstract we state the lemmas and theorems. The proofs are 
deferred to the full paper. In our proofs we use the following reworded Lemmas 
and properties from the algorithm of [ACK90] : 

Lemma 1. (Lemma C.l [ACK90]) FIND subroutine terminates. 

Lemma 2. (Lemma C.2 [ACK90]) LfPDATE subroutine terminates. 

Lemma 3. (Lemma C.3 [ACK90]) Upon the termination of UPDATE the tree 
repliea at eaeh node is a subset of the real tree to whieh the node belonged upon 
invoeation of UPDATE, and a superset of the real tree upon termination of 
UPDATE. 

In our proofs we use the transition system with asynchronous message passing 
model of [Tel94]. In order to induce a notion of time in executions we use the 
following easual order. 

Definition 3. [Lam78] Let H be an exeeution. The relation <, ealled the casual 
order^ on the events of the exeeution is the smallest relation that satisfies: 

(1) if h and f are different events in the same node and h oeeurs before f, then 
h<f. 

(2) if s is a send event and r the eorresponding reeeive event, then s < r. 

(3) < is transitive. 

5.1 Correctness 

As long as the loop-freedom invariant is kept the following three lemmas are 
follows: 

Lemma 4. The real forest is indeed a forest at all the algorithm exeeution. 

Lemma 5. The direetion (parent pointer) of every real tree’s edges induees only 
one root at every real tree. 

Lemma 6. The real trees are disjoint. 

Lemma 7. The algorithm’s first phase, the Trees Approval phase, preserves the 
loop-freedom invariant. 

Proof Sketch: : During the first phase edges can only be deleted from a node’s 
replica. | 

Lemma 8. The algorithm’s seeond phase, the Trees Merge, preserves the loop- 
freedom invariant. 

Let u and v be two neighboring nodes in a real tree, and Tu and Ty be the 
tree replicas of nodes u and v respectively. 

Theorem 1. When the algorithm’s seeond phase (the Trees Merge phase) ter- 
minates, Tu and Ty are identieal. 
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5.2 Amortized Communication Complexity 

The following definition attempts to capture the “maximum information” a node 
may have on the actual connected component it belongs to. That is, if a node 
z receives a message that was originated by some far away node x, and every 
node on the message’s way (from x to z) forwarded the message before any fault 
was detected, then x may still be in Node z’s connected component. There is no 
distributed algorithm that can detect that this is not the case. Thus the message 
complexity of any algorithm must be based on the assumption that x and z may 
still be in the same connected component. 

Definition 4. Edge {x,y) is a z-virtual directed edge if (1) there is a eausal- 
ity ehain of send and receive events ([Lam78]) from Node x, through Node y, 
{x,y)={wi,W 2 ), {w 2 ,ws), ... {wi-i,wi)={wi-i,z) to Node z, and (2) 

between every receive event (on this ehain) {wi-i,Wi) and the following (on this 
ehain) send event (wi,Wi-^i) there is no event at Node Wi of a failure of edge 

(Wi-i,Wi). 



Definition 5. Node z^s Virtual Component.* the set of z -virtual directed edges. 

Definition 6. Physical Component.* The set of nodes that from a global point 
of view ([Lam78]) of the network are in the same connected component. 

Lemma 9. Every node that appears in Node u ’s tree replica is also in u ^s Virtual 
Component. 

Lemma 10. The size ofu’s tree replica is bounded from above by the size ofu’s 
Virtual Component. 

Proof Sketch: : The lemma follows from lemma 9. | 

Let Ak be Node k^s Virtual Component. 

Lemma 11. Eor every execution there exists an equivalent execution ([Lam78]) 
where Node k ’s physical component is Ak . 

Theorem 2. Every algorithm that maintains, in a node, a replica of the node’s 
connected component (if the input stops changing then the output, the node 
replica, is required to become eventually the node’s connected component) cannot 
maintain a smaller replica than our algorithm’s tree replica. 

From Theorem 2 it follows that defining an algorithm measures as a function 
of the virtual component size is the most accurate measure according to which 
a distributed algorithm can be measured. 

Define A as the union of the trees replicas of all the nodes in the same physical 
component when the last topological change occurs. 

Theorem 3. Assume that k topological changes occur. Then the number of 
edges identities exchanged during the algorithm execution is 0(kA). 
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5.3 Quiescence Time 

Lemma 12. Every invocation of procedure TREE MERGE terminates. 

Lemma 13. Every edge failure is unmarked in at most one of a given tree node’s 
replica. 

Proof Sketch: : An edge failure disconnects the real tree and the two edge’s 
endpoints become nodes in distinct real trees. | 

Lemma 14. The quiescence time of the algorithm’s first phase, Trees Approval 
Algorithm, is 0{A). 



Lemma 15. The quiescence time of the second phase of the algorithm- Trees 
Merge, is 0{A\og^ A). 



Theorem 4. The algorithm quiescence time is 0{A\o^ A). 
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